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Abstract 


Four algorithms are presented for computing symbolic 
diagonal Pade approximants for a power series whose 
coefficients lie in a field. Precise bounds are developed 
for the worst case cost of each of the four algorithms using 
both classical and fast power series and polynomial 
arithmetic. Following this, empirical CPU timing results, as 
obtained for each of the algorithms, are given. Based on 
both the theoretical and empirical costs produced, it is 
shown that one method, using classical arithmetic, is 
Superior for obtaining approximants of low degree and that a 
different method, using fast arithmetic, becomes superior 
for larger problems. Estimates for the crossover in 


Superiority between the two methods are obtained. 
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1. Introduction 

Over the last few years, a renewed interest has been 
expressed in the computation of Pade approximants for a 
power series, which represents one approach to converting a 
power series to rational form, as outlined shortly. Pade 
approximants have held a long and established place in 
mathematical analysis and theoretical physics, as indicated 
by Gragg [12] and Geddes [10,11]. However, this lastest 
interest has been generated from within the study of 
Symbolic computer algebra by the desire to be able to 
efficiently manipulate rational functions. 

The most straightforward approach for dealing with 
rational functions, that of retaining both the numerator and 
the denominator as separate entities, 1s susceptible to 
unpredictable growth in the numerator and the denominator. 
In particular, performing arithmetic on rational functions 
has a tendency to dramatically increase the sizes (in terms 
of their degrees) of the numerator and denominator of the 
result. This makes it necessary either to retain and work 
with undesirably large values, or to continually compute 
polynomial greatest common divisors in order to keep the 
rational function in reduced form. For some (perhaps all) 
applications, a possible alternative is available in the 
form of truncated power series. 

The idea behind the use of power series for this 
purpose is that the rational function is first converted to 


a truncated power series of a predetermined number of terms. 
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The desired calculations are then performed on this power 
series with an appropriate number of terms being retained. 
Finally, the resulting power series is (are) converted back 
into rational form. 

The transformation from rational form to truncated 
power series form is relatively easy to perform, for 
example, by applying power series division (described in 
Chapter 2) to the numerator and denominator polynomials. 
However, the reverse transformation iS not as easy to 
accomplish. It is in this area that the usefulness of Pade 
approximants becomes apparent. To date, Pade approximation 
techniques represent the most promising method for 
converting a rational function from truncated power series 
form sbackieto sratrvonalriorm: 

The recent work on Pade approximants has resulted in 
the development of three fast algorithms that reduce the 
complexity of computing diagonal Pade approximants (defined 
in the next section) to O(nlog“n) from the previous best of 
0(n*) as achieved by the classical algorithms. It is the 
purpose of the present research to perform a comparative 
evaluation of these three algorithms along with the best of 
the classical methods. The analysis is carried out at both 
the abstract and the practical level, but it is the 
practical results which form the major contribution of this 
study. Although two of the algorithms are designed to handle 
more general cases, and are presented in their more general 


form, only the computation of Pade approximants which lie 
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along the main diagonal of the Pade table is considered. 
This 1S not limiting in any practical sense since it is 
shown that algorithms for computing diagonal approximants 
can easily be used to obtain any arbitrary Pade approximant 
for a power series. A further limitation imposed in the 
present study 1S that all polynomials and power series are 
assumed to be univariate with coefficients lying ina field. 

No mention is made throughout this presentation of any 
classical method for computing Pade approximants other than 
the one mentioned above. The reason for this is twofold. To 
begin with, the classical method that is included herein is 
2/3 the cost of the next best classical method. Furthermore, 
almost all of the other classical algorithms fail for power 
series that are abnormal (as defined in the next section). 
Details of these other approaches can be found in Gragg [12] 
and Geddes [10,11]. 

Throughout this presentation, some familiarity with the 
basic concepts of abstract algebra is assumed. Should such 
knowledge be lacking, it is advised that a text on abstract 
algebra, such as Herstein [13] or Lipson [19], be consulted 
to obtain a rudimentary background in the preliminary 
aspects of ring and field theory. 

An effort has been made to keep this discussion as 
complete as possible in an attempt to reduce the need to 
consult the various references. To meet this objective, a 
Substantial amount of existing theory is included in this 


presentation. In particular, although sometimes stated in a 
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different form in order to produce a unified approach, the 
algorithms, theorems, and proofs presented herein are not 
unique to the present discussion and can all be found in the 
references listed at the end of this presentation. However, 
because Simply knowing the order of complexity of the 
algorithms is not completely sufficient for performing a 
practical comparison, more precise worst case cost estimates 
are developed in this presentation than are available 
elsewhere in the literature. 

In the abstract presentation of any polynomial or power 
series algorithm in this discussion, all theoretical cost 
estimates are derived in terms of unit operations, or 
coefficient operations. Such operations include the 
addition, subtraction, and multiplication of two elements 
from the coefficient field. It is assumed that the cost of 


each of these three types of operations is the same. 


1.1 Pade Approximants 
A univariate polynomial or power series is defined to 


be an expression of the form 
n 
A(x) = 2 


where n is a specific integer value for polynomials and n is 
o for power series. The a, values, referred to as the 

coefficients of the polynomial or power series, are selected 
from a specific algebraic system. For present purposes, this 


system is assumed to be a field, known as the coefficient 
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field. By convention, it is assumed that ag = Ontor@all 

k > n. The value x is the indeterminate of the polynomial or 
power series and iS a variable whose domain is the same as 
the coefficient field. 

Because it 1S not possible, in practice, to retain an 
infinite number of terms, power series are often truncated 
to a specified number of terms. This makes truncated power 
series look very much like polynomials. A difference does 
Still exist, however, in the way that the division operation 
is defined for these two sets. The details of this 
difference is deferred until Chapter 2, where the general 
concepts of polynomial and power series arithmetic are fully 
described. 

A somewhat non-standard convention is adopted for 
denoting polynomials and power series throughout this 
presentation. If A(x) is a polynomial or truncated power 
Series, then A denotes the vector of coefficients of A(x). 
Since all polynomials and power series considered in this 
discussion are assumed to be univariate, the vector of 
coefficients completely describes the polynomial or power 
series. Therefore, because it results in less cluttered 
notation, the vector form of denoting polynomials and 
truncated power series is used. Unless otherwise indicated, 
the kth coefficient of polynomial or power series A is 
denoted by ay while the indeterminate of all polynomials and 


power series 1S assumed to be x. 
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Two functions are used to indicate the magnitude of a 


polynomial or truncated power series. 


Deranr Crom tal 


Let A be a non-zero polynomial or truncated power 


series. Then 


1) deg(A) = max{k such that a, #03 degree of A, 


2) ord(A) = minvk Such that a, #03 Ordinate of A’. 


By convention, deg(0) is -» while ord(0) is ~. Given this 
definition, a zero-ordinate polynomial or power series is 
one whose ordinate is zero. Some properties of the degree 
and ordinate functions, which are required later, now 


follow: 
Theorem 1.1.2 


Let A,B be two non-zero polynomials or truncated power 
series. Then 

1) deg(At+B) < max{deg(A),deg(B)}, 

2) deg(AB) < deg(A)t+deg(B), 

3) ord(A+B) = min{ford(A),ord(B)}, 


4) ord(AB) = ord(A)+ord(B). 


These properties are a direct consequence of the definitions 
of the two functions and the definitions of the addition and 
multiplication operations for polynomials and power series. 

When dealing only with polynomials, the second inequality in 


this theorem becomes an equality. Note also that fourth 
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property as applied to truncated power series assumes that 
the product of A and B is not zero within the degree of 
Lnuncation, 

When considering the division operation for 
polynomials, an unfamiliar notational convention is adopted 
in this presentation. AS with the integers, the division of 


two polynomials produces both a quotient and a remainder. 
Dern il ion i. Wes 


Let A,B,Q,R be polynomials such that B # 0, A = QBtR, 
and deg(R) < deg(B). Then 
1) |A/B| = Q = quotient of A/B, 


2) A mod B = R = remainder of A/B. 


Although non-standard, this method of denoting the quotient 
of the polynomial division operation is not unique to the 
present discussion as it has been used in such works as Aho, 
Hopcroft, and Ullman [1]. The mod operation should be a 
familiar concept from abstract algebra. 

Since the division of two power series only produces a 
quotient, a Similar method of distinguishing the quotient of 
a power series division is not required. Furthermore, a mod 
operator of the type defined for polynomials does not exist 
for power series. However, a Similar mod operation is 


defined for power series: 
Definition 1.1.4 


Let A be an arbitrary power series. Then 
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Note that the second operand for the mod operator as applied 
to power series is always a power of the indeterminate. 
Thus, the mod operation for power series provides a 
convenient mechanism for indicating the degree of truncation 
for a truncated power series. 

One final algebraic system needs to be defined. A 
rational function is an expression of the form A/B where A, 
the numerator, and B, the denominator, are polynomials. 


Equality for rational functions is defined as follows: 
DEtinition 1.1.0 


Pet.A/B. C/D. be  tWwoerettone! functions, «ben. A/B = C/D 


Mot ADieee BO. 


Note that this definition makes no assertions about the 
equality of the numerators or the denominators. The two 
basic arithmetic operations performed for rational functions 


are given below: 
Pecan tilLOn | scl. 6 


bet A/B, C/D be two rational’ functions. Then 
1) (A/B)+(C/D)” = (AD+BC)7 (BD), 


2 CAT B) C7 DD) =e UAC) 7 ED). 


Finally, a rational, function, .A/B,.is.said)to bes.in reduced 
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With this background information given, it is now 


possible to introduce the concept of a Pade approximant. 


Beeenit LON. lievbyes 


Let A be a power series. Then the rational function 


P/Q is an (m,n) Pade approximant for A if 


12 AQ=Pa= 0 mod pit hale 


lA 


2) deg(P) m, 


IA 


3) deg(Q) ne 


One way of interpreting this definition is that, if a 
Patvonain functton, P/O, 1with ord(O) = 0,5iSeane (m,n) Pade 
approximant for a power series, A, then the application of 
power series division to P and Q produces a power series 
whose first mtn+1 coefficients are the same as the 
corresponding coefficients of A. A rational function is 
referred to as a reduced (m,n) Pade approximant if it is 
equivalent to the reduced form of the rational function 
represented by the (m,n) Pade approximant. Note that by this 
definition, a reduced Pade approximant is not a Pade 
approximant according to the definition given above; it is 
simply the reduced rational form of the given Pade 
approximant. 

The complete set of Pade approximants for a given power 
series are normally organized into a table, called the Pade 
table for that power series, where the (m,n)th entry of the 
table corresponds to the (m,n) Pade approximant for the 


power series. A power series is said to be normal, with 
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respect to its Pade table, if every approximant in the Pade 
table is not equal to any other approximant in the table. A 
power series is (m,n)-normal if its (i,j) Pade approximants 
are distinct from each other for all 0 < i < m and 

O <j <n. A power series is normal along a diagonal or 
anti-diagonal of its Pade table if all approximants along 
that diagonal or anti-diagonal are distinct. 

If a Pade approximant for a power series lies on the 
main diagonal of the Pade table for that power series, then 
that approximant is referred to as a diagonal (n,n) Pade 
approximant for the power series. A handy definition 
associated with diagonal Pade approximants iS given as 


follows: 
Theorem 1.1.8 


Given a diagonal (n,n) Pade approximant for a power 
series, the value of n is defined to be the degree of 


conversion for that Pade approximant. 


This definition provides a useful means of referring to the 

position of a diagonal Pade approximant in the Pade table. 
An interesting result concerning the Pade table for an 

abnormal power series (one which is not normal) is contained 


in the following theorem. 
Theorem 1.1.9 


Let A be a power series. Let P/Q be a reduced rational 


function with deg(P) = m and deg(Q) = n such that 
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AQ-P = 0 mod aren Let k 2 0 be the largest integer 


Ment k+ 


such that AQ-P =a0'mod,x Finally, LetR.= be 


1) 
the (i,j) Pade approximant of A for m S$ i s mtk and 


n < j < ntk. Then Ry = P/O. 


The proof of this result is given by Gragg [12], whose 
article constitutes an excellent discussion of the concept 
of Pade approximants in general. The implication of Theorem 
1.1.9 is that equal Pade approximants for an abnormal power 
series occur in rectangular blocks in the Pade table for 
that power series. 

Another useful feature of Pade approximants is 


presented as the following result. 


Theorem 1.1.10 


m-n 
and 


Let A,B,C be power series such that B = A mod x 
Cit= (a=B) /xn7" for two integers m =n. Now let P/Q be 
the (n,n) Pade approximant for C. Then the (m,n) Pade 


approximant for A is the rational function P/Q where 


m-n mz 


P = BQ+Px and Q = Q. 
Proots: 
AQ-P = AQ-(BOQ+Px™ ") 
= (A-B)Q-Px™ ” 
= (CQ-P)x™ ” 
a OEncde sie ae 


=udpeanodix ys 
Also, deg(P) = deg(BQ+Px” ") 


ie 


< max{deg(BQ) ,deg(P)+m-n} 
< max{m-1,m} 


= m 


and  deg(Q) deg(Q) <n. 
Thus, P/Q is the (m,n) Pade approximant for A. 
Ore. Ds 
This result shows that to compute any Pade approximant on or 
below the main diagonal of the Pade table for a power 
series, it is only necessary to have available a method that 
computes approximants along the main diagonal of a Pade 
table. 
To complete this abstract discussion of Pade 
approximants, one more property of such approximants is 


presented. In order to present this result, the definition 


of Pade approximants is first modified slightly. 
BEL INUE LON shat al 


Let A,B,C be power series such that A = B/C. Then the 
rational function P/Q is an (m,n) Pade approximant for 
AGLE 


1) BO-CP = 0 mod x P 


2yedeg(P) sem; 
3) deg(Q) <n 
Note that if C = 1, then this new definition is the same as 


that given earlier. Using this new definition, the following 
relation between the Pade approximants of two power series 


which are inverses of each other can be given. 
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Theorem 1.1.12 


Let A,B,C be power series such that A = B/C and 
ord(A) = 0. Let P/Q be an (m,n) Pade approximant for 
A. Then Q/P is an (n,m) Pade approximant for power 


series el = C/B, 


The proof of this theorem is trivial, given the extended 
definition of Pade approximants. This result is given using 
the more general definition of Pade approximants because it 
is required in that form later. A benefit of this result is 
that pecoupledrwith Theorem 1.1.10, it indicates that the 
computation of any Pade approximant for a power series 
requires only the availability of a method for computing 


main diagonal Pade approximants. 
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2. Polynomial and Power Series Arithmetic 

Of primary concern in the analysis of any algebraic 
algorithm are the methods used to perform the basic 
arithmetic operations, since these have such a direct effect 
on the complexity of the problem. Two techniques are 
commonly considered when discussing polynomial and power 
series arithmetic: the classical algorithms and the fast 
algorithms. Even though the asymptotic complexity of the 
latter are known to be Superior, both approaches are 
presented in this chapter for reasons which will become 
apparent later. 

Before proceeding, a notational convention adopted in 
this chapter is restated. All polynomials and power series 
are assumed to be univariate with coefficients belonging to 
a field. If A(x) is a polynomial or power series, then A 
denotes the vector of coefficients of A(x) while a, denotes 
the kth coefficient of A(x). On the other hand, if V is a 
vector, then V(x) is the polynomial or power series whose 
coefficients are the elements of V. Taking advantage of this 
duality, only the vector form notation is used for 
polynomials or power series, Since it is less cumbersome in 
appearance and does not result in any ambiguities, while x 
is used to denote the indeterminate of all polynomials and 
power series throughout. 

A couple of definitions are also repeated. Firstly, a 
zero-ordinate power series iS a power Series whose constant 


term is non-zero. Secondly, the costs of polynomial or power 
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series algorithms are determined in terms of unit 
operations, which are defined to be the operations of 
addition, subtraction, or multiplication when performed on 


two elements from the coefficient field. 


2.1 The Classical Algorithms 

The easiest arithmetic operations to perform on 
polynomials and power series are those of addition and 
Subtraction. If A and B are two polynomials, then their sum 


or difference, C = A + B, is calculated as follows: 
Algorithm 2.1.1: Polynomial Addition and Subtraction 


Input: A,B are polynomials. 


Obt putcrmiCar=8A ct SB . 


1) n = max[deg(A)+1,deg(B)+1]. 
2) For i= 0 until n= iedosbeqin 
3) Cis avant. Bo % 


end. 


It can be observed that both polynomial addition and 
subtraction each require n operations involving elements 
from the coefficient field. If A and B are two power series 
instead, then the value of n becomes the number of terms 
retained when using truncated power series and the 
complexity of addition and subtraction is the same. 

Tet products, C ="ABy or two arbitrary polynomials or 


power series, A and B, is defined by the Cauchy product as 


follows: 


Algorithm 2.1.2: Polynomial/Power Series Multiplication 


Input: A,B are polynomials or power series. 


Output: C = AB. 


Pen =-déeg(C ys. 


2) OFOr di t=yO Suntil tn=igdo i begin 


3) c, = 0. 
4) Forija= Osuntirmisdoibegin 
5) ec, Sercent ae hee: 
end. 
end. 


For polynomials, n = deg(A)+deg(B)+1 whereas for power 
series, n is the number of terms retained when using 
truncated power series. The complexity of this algorithm, as 
determined by step 5, is n“+n unit operations. 
For polynomials, the complexity of Algorithm 2.1.2 can 

be improved by using the fact that 

eon = 0 it ane aed CA) “or (=4) > deg(B), 
Eliminating these terms from the previous algorithm, and 


rearranging, produces the following result: 
Algorithm -2.h?3: $Polynomial Murtiplireatiron 


Input: A,B are polynomials. 


Output: C = AB. 
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0 until deg(A)+deg(B) do on Se 
2) For i = 0 until deg(A) do begin 

3) For j = 0 until deg(B) do begin 

4) Crmri-cey. sracbee 


1+) 1+) i 


end. 


The complexity of this algorithm is 2[deg(A)+1][deg(B)+1], 
which is easily verified as an improvement over the previous 
result. A similar improvement is not possible, however, for 
arbitrary power series multiplication because, in general, 
deg(A) = deg(B) = n-1 for any arbitrary truncated power 
series, A and B. 

Up to this point, the arithmetic operations for 
polynomials and power series have been very similar. The 
Same is not true for the operation of division. This is due 
to the fact that the division of two polynomials produces 
both a quotient and a remainder while the division of two 
power series produces only a quotient. For two arbitrary 
polynomials, A and B, the quotient, Q, and the remainder, R, 


are produced as follows: 
Algorithm 2.1.4: Polynomial Division 


Input: A,B are polynomials such that deg(B)<deg(A). 


Output: Q,R such that A = QB+R, deg(R)<deg(B). 


1) R=A, m=deg(A), n=deg(B). 
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$3) For 1 = m=n*®until 0 do begin 


4) qe tc. 
5) ye 0. 
6) For j=) Until ntt=it9do begin 
ih Pei 1 20 Bin eres 
He oT Spy ite Gia ieat 
end. 


Step 2 requires calculating the inverse of a unit element, 
the cost of which is dependent on the actual coefficient 
field used. Step 4 requires 1 operation per execution while 
steps 6 and 7 combined require 2n operations per execution. 
Thusp the total cost tof cpolynomial vdivisiontis 2n(m=nprmtntt 
unit operations plus 1 unit inverse computation. 

On the other hand, the algorithm for dividing an 
arbitrary power series, A, by a zero-ordinate power series, 


B, 18 a variation of the Cauchy rule. 


Algorithm 2.1.5: Power Series Division 


Input: A is an arbitrary power series, 


B iS a zero-ordinate power series. 


Output) O t= aB | mod x” 

1) n = deg(Q)+1 = number of terms required. 
Byres | 

ce sD ae 
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4) q,; = 4;. 


5) For) 4 =) 0 until i-1 do begin 
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In this algorithm, step 2 requires the calculation of a unit 
inverse, steps 5 ane 6 combined require 21 operations per 
execution, and step 7 requires 1 unit operation per 
execution. This brings the total cost of power series 
division to n? unit operations plus 1 unit inverse 
computation. 

A careful examination of the last algorithm shows that 
not every power series can be a divisor in the division 


operation. This is because the constant term, b of the 


0” 
divisor need not be non-zero for every power series. If the 
constant term is zero, then its inverse does not exist so 
that the division algorithm will not work. This situation 
can be rectified to some extent so that division will work 
if the ordinate of the dividend, ord(A), is no less than the 
ordinate of the divisor, ord(B). Since all power series 
divisons required by any later algorithms meet the 
conditions of the above algorithm, this extension is not 
presented. It iS important to remember, however, that this 
restriction does exist. 

Although covered by the previous algorithms, the 
multiplication and division of an arbitrary polynomial or 


power series by a power of its indeterminate should be dealt 


with as separate operations. These operations often do not 
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have any cost attributed to them, regardless of the size of 
the values being operated on, because no actual unit 
arithmetic is required to produce the result. Although the 
computational model permits this assumption, it is not 
completely justified in practice since, at best, some 
manipulation of the coefficients is invariably necessary. 
Consequently, to retain a more accurate account of the cost 
of these operations, a different convention is adopted. 
Given an arbitrary polynomial or power series, A, anda 
power, k, of the indeterminate, the multiplication and 
divison of A by the kth power of the indeterminate have the 


following associated costs: 


Polynomial Cost Power Series Cost 


Multiplication deg(A)+k+1 


Division deg(A)+1 


where n is the number of terms retained for truncated power 
series. The polynomial division cost is the cost of 

producing both the quotient and the remainder. The cost for 
producing only one of the two results is assumed to be the 


Same as the cost of producing both. 


2.2 The Fast Fourier Transform 

A polynomial or power series can be represented by 
either its coefficients, as 1s the most common practice, or 
by a vector of values which result when it is evaluated at a 


specified number of points. The number of points for which 
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this evaluation is performed is at least one more than the 
degree of the polynomial or power series. The discrete fast 
Fourier transform and its inverse provide an efficient 
method for changing from the former representation to the 
latter type and back again. This transformation is discussed 
because it provides the basis for the fast polynomial and 
power series multiplication and division techniques 
presented in following section. 

Part of the Fourier transform's efficiency derives from 
the choice of points for which the evaluation is performed. 
For a polynomial or power series of degree less than n, the 
points used are the nth roots of unity for the coefficient 


field. 
Dewy Matton 2s 2.11 


An element, w, in an arbitrary field is a primitive 
(orSprineipal Mntherosteetauntrty ort 
wo 1 and wth 


foveal lc oe<cti1estne . Taegvalues ws Ot elLacke=-e0 ous ne | 


are referred to simply as nth roots of unity. 


This definition, given by Borodin and Munro [4], differs 
somewhat from that normally encountered in the literature, 
as represented, for example, by Aho, Hopcroft, and Ullman 
[4) or Horowitz and Sahni [14], in that the following 


property is usually included as part of the definition. 
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Let w be a primitive nth 
field. Then 
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FOOGHOL@UnITY Anwantarorerary 


Themut2 =e (z°-1)/(z-1) 


I 


= 0. 


Pr =1141 
[(w") I-49 /(wI-1) 


w-1) 


O7BD: 


The advantage of removing this property from the definition 


is that without it, it iS easier to verify whether or not 


any given element of a field might be a root of unity. 


Now, let A be a polynomial or power series such that 


deg(A) <n, and let w be a primitive nth root of unity for 


the fcoefficient "field. Then 
transform is defined by the 
ha 
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where Vv; is the ith element 
alternate representation of 


algorithm uses a divide and 


the forward discrete Fourier 


equation 


Cea Salty 


of the vector, V, forming an 
A. The fast Fourier transform 


conquer approach to reduce the 


problem of computing these elements to two smaller problems 


of half the complexity. The 


value used for n by this 
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n= 2 where r = [log[deg(A)+1]]. The fast 


form algorithm can then be presented as 


follows: 

ALGorvenm 2sec.0° FeTliA,Vy won 

Input: A iS a polynomial or power series, 
w 1S an nth primitive root of unity, 
rae pllogideg(A)+1]]_ 

Output: V = the Fourier transform vector for A. 

Comment B,C are polynomials or power series, 

S,U are vectors. 

1) tion = 1. then Vo = a else begin 

2 mo =sn/2- 

=) Fon 1) =) 0 suntidim-1 do begin 

a Dee so ae cal 
end. 

o) Call FFT(B,S,w-,m). 

6) Cal FFT(C,U,w*,m). 

1) dus 

8) Fou) = 0) until miedo begin 

9) dt e= du,. 

10) vi-s.4d hy Viipees -d 

11) d = dw 
end. 

end. 
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Let T(n) denote the total cost of this algorithm. Steps 1 to 
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4 are assumed to be cost-free. Steps 5 and 6 each require 
T(n/2) unit operations. Steps 9, 10, and 11 combined require 
4 operations, bringing the cost of the loop starting at step 
8 to 2n unit operations. Therefore, 
T(npeey 2T(n/2y+2n.= 2nlogn+n. 
Veri nication mol sone facternat eA Gorithta2. 2.5 does 
indeed compute the forward Fourier transform requires the 


use of a couple of properties related to roots of unity. 


Theorem 2.2.4 


Let w be a primitive nth root of unity in an arbitrary 


treldagand’ etint=" 2m. Then 


wien = aw 
Proof: 
First note that w? = w°™ = 1, 
Since w" # 1 by definition, this means that wr s -1, 
Therefore, ae = wiwm = vw (=1) = yan 


Theorem 2 .°2.,5 


Het iw bercs primacive ntherooteor unity in anvarbitrary 
field and let n = km. Then wk is a primitive mth root 


of unity for the given field. 


Proof: 
Note that (w 
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Therefore, (w')> = w # 1 since w is a primitive nth 


root Om unity, 
Consequently, wh fits the definition of a primitive 
mth root of “unity ; 
©. Be). 
The validity of the fast Fourier transform algorithm can now 
be verified using mathematical induction. First note that 
the recursive calls to produce vectors § and U rely on we 
Beipceacprimitivesin/2) th rooteomsunhty.,eThissicy cin fact, 
guaranteed by the last theorem. Now assume that the 
algorithm works for n/2. In particular, this assumption is 
equivalent to accepting S and U as the true forward 


transforms for B and C, respectively. The inductive step of 


the verification then goes as follows: 
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Therefore, since all elements of V are covered by these two 
cases, and since they show that, given the initial 
assumption, the equations used by the algorithm are the same 
as computing the defining equations, the fast Fourier 
transform algorithm can be considered valid by the principle 
of mathematical induction. 

As it turns out, the reverse (or inverse) Fourier 
transform, which computes the coefficients of a polynomial 
Or power series given its values at the nth roots of unity, 
is just a slight variation of the forward transform. Let A, 
w, and V be the same as defined for the forward transform. 
Then the coefficients of A are computed from V using the 
equation 


1 : Neal 


ee) ee ae ees 
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k=0 
The rightmost side of this equality suggests that the 
reverse transform can be computed by making direct use of 
the fast forward transform algorithm with wae replacing w as 


the primitive nth root of unity. This, of course, requires 


that w | is also a primitive nth root of unity. 


\ ei 
’ 
A 
Irigmiets: Lit 
i A _ ton! wood? ead saths 
” is 
y -_ 7 Pa a. 
i : ‘ a. ea = a y wt ae 
SiS “wr im2y enepysipe end \ fo qin 
: | ; 
°» - > ia = 
: ‘ PNT sis ota SGN > 2 
7 7 Gu 
% : ; 
4 ba 
ensi7 OTSeTy 
= ms va c 
om ’ | é 
r : ; 
2 
13 seas YE pite 
i 4 ; } oS > ot 
“it 
4 or 4 i .@ a9 
; tr } 4 
_* i { 4 
‘oe AS 


27 
Theorem 2.2.6 


Let w be a primitive nth root of unity in an arbitrary 
field. Then wi! is also a primitive nth root of unity 


for the same field. 


Proof: 
Birst observe that (wo JS te) ) =i) 1] 1. 
Now Let) i Sen-= n=". 
Then, since wi ee ealnge = Cm ee 
Consequently, eel sea primitive nth root of unity. 
Oe De 
Therefore, let the values of w and n be the same as for the 


fast Fourier transform algorithm. Then the reverse transform 


can be calculated using the algorithm given below. 
ALgouwithme2}2a7: rrr | (A,v,w,n) 


Input: V is a vector of values, 


Wi tS.an- Nth primitive coot of unity, 


1G 


Drees ae cert Or - Some wl 


Ourputs A the polynomial or power series, deg(A)<n, 


whose Fourier transform is V. 


iMekor{#i= Osuntibansbido begin 
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3) Call FFT(A,V,w. 
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5) aFOr im-RBO Runt LY n-1 do begin 
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end. 


Theaverviicartvonmof this algorithm is trivial, so the 
details are omitted. As to the algorithm's complexity, the 
loop starting as step 1 is cost-free, step 3 has a cost of 
2nlogn+n operations, and the loop starting at step 5 has a 
cost of n operations. The single operation at step 4 is 
ignored. This gives the reverse transform algorithm a total 
cost of 2nlognt+2n unit operations. | 

The algorithms just presented for computing both the 
forward and the reverse Fourier transforms rely on the use 
of recurSive procedure calls to produce some of their 
efficiency. While such calls do not affect the theoretical 
complexity of an algorithm, they do have an associated 
practical overhead which makes recursive algorithms less 
desirable than iterative algorithms of the same cost. (In 
fact, this overhead is charged on all procedure calls since 
most programming language implementations do not distinguish 
between recursive and non-recursive procedures.) For this 
reason, an iterative version of the fast Fourier transform 
algorithm, based on an evaluation technique developed by 


Fiduccia [9], is given below. 
Algorithm 2..2.8:,0teravive. PRT CA, Vjwiyn) 


Input: A is a polynomial or power series, 


w is@ansnth primitive root of unity, 
>[logldeg(A)+1]] | 
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Output: V = the Fourier transform vector for A. 
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FOr pe=s0sdincir., n= 1 .do begin 


end. 
m=n/2, j=0. 
Borei t= Osuntilin-2 “dotbegin 


If i < j then begin 


nee 
a Aten 
i J 
Vien 6 
J 
end 
ie EF Hi 


while k < j+1 do begin 


ese Ks 
KeaOk 2 
end. 
j= jtk. 


end. 
Bor k= 2estep, kK until n do begin 
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end. 


end. 


Verification of the validity of this form of the algorithm, 
and of the fact that it has the same complexity as the 
recursive form, can be found in Horowitz and Sahni [14], 
from whom this particular version was obtained, or in Aho, 
Hopcroft, and Ullman [1], who present a similar version of 


the iterative approach. 


2.3 Fast Multiplication and Division 

Fast polynomial and power series multiplication use the 
fast Fourier transform to convert the problem to another 
less costly one. In computing the product, C = AB, for two 
arbitrary polynomials or power series, A and B, this is 


accomplished as follows: 


Algorithm 2.3.1: Fast Polynomial/Power Series 


Multiplication 


Input: A,B are polynomials or power series. 


OULDUE. Ce = AG, 
Comment U,Vrare vectors: 


1) cv = flogldegtA)+deg(B)ta]]|% 


2) ove 


, m=deg(C)+1. 
S))¥ = "primitive mth root ofp unity. 
4) Call FFT(A,U,w,n). 
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6) For 1 =~,0Guntil»n-1 do begin 
7) Uae gUly 

end. 
8) Call FFT '(C,U,w,n). 
9) C = C mod x". 
For polynomials, deg(C) = deg(A)+deg(B) while for power 
Series, deg(C) is one less than the number of terms retained 
for truncated power series. It should be noted, therefore, 
that the last step of the algorithm is only required for 
power series multiplication. This algorithm is directly 
based on the following property of the Fourier transform 


when applied to polynomials. 
Theorem 2.3.2 


Let A,B be arbitrary polynomials and let F(-) denote 
the Fourier transform operation. Then 


F(AB) = F(A)F(B). 


Let C = AB, m = deg(A), and n 2 deg(A)+deg(B). 
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1 hee 
i Ce. aD (n+i-j) mod n? * ae abi 4? 
Dz 
7 ey oa inka 4) mod n° 


Now let El gids denote the kth element of the Fourier 
transform vector. 
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Turning, tosthercost of this algorithm, steps 1 to 3. are 
ignored, steps 4 and 5 each require 2nlogntn operations (for 
a total of 4nlogn+2n operations), the loop starting at step 


6 takes n operations, and step 8 has a cost of 2nlognt+t2n 
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operations. This brings the complete cost of fast polynomial 
multiplication to 6nlogn+5n unit operations. The cost for 
fast power series multiplication has an additional n 
Operations associated with it as a result of step 9, 
Bbranging its total cost to 6nlognt+6n unit operations. 

A comparison of the fast multiplication costs with 
those for classical multiplication indicates that the fast 
multiplication method is not immediately superior. Assuming 
the worst case Situation for classical multiplication, which 
occurs when deg(A) = deg(B), it can be observed that the 
crossover for polynomial multiplication occurs when 

2[deg(A)+1][deg(B)+1]-6nlogn-5n 2 0. 
This happens somewhere around deg(A) = deg(B) = 55. The 
crossover for power series is a little higher. If m is the 
number of terms retained for power series, then the two 
techniques crossover when 

m+m-6nlogn-6n = 0 

which happens when m = 78. The crossover point presented 
here for polynomials is a little higher than that produced 
by Moenck [23]. This discrepency is due to a slightly 
different approach to accumulating the number of unit 
operations. In particular, Moenck does not include the 
operation in step 11 of Algorithm 2.2.3 (the fast Fourier 
transform) in his counts. 

Fast division of both polynomials and power series rely 
heavily on the ability to perform power series inversion 


efficiently. A fast method for determining the inverse of a 
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power series can be obtained through the use of Newton's 
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approximation formula. Let A be a zero-ordinate power series 


and let n be the number of terms desired in the result. Then 


ie can be computed by using: 


Algorithm 2.3.3: Fast Power Series Inversion 


PSINV(Q,A,n) 


Input: A iS a zero-ordinate power series, 

n = number of terms desired in result. 
Output: Q = al mod x", 

el 
Ted Qo=ay m=[logn]. 
2)SFOr -i,=, 1 until m dosbegin 

Z 2! 
3) QO. = (2Q,_,-AQs_,) mod x 
end. 


4) Q = Q, mod ae 


This algorithm is based on the idea that each successive 


| than the 


value for Q. is a closer approximation to A™ 
previous one. That this is indeed the case is verified as 


follows: 


Let e a '-9, Pot ta biei0 {st ose. 


Ee ay 
Then C4 = A Qn 44 
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dov=ih 3°52 2 
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a '-(a7 '-ae?) mod x 


k+1 
= Aer mod x? : 


Thus, it can be shown by induction that 


ord ( Su 


IV 


Cpay) 2 2 
Therefore, ord(e ) Pipe koa 


Consequently, AQ = A(A” '~e, ) = i-Ae. = 1 mod x". 


The last equation shows that the value of Q 1S accurate up 
to the degree of truncation for truncated power series. 

In order to determine the cost of the inversion 
i 


and let N = 2™, Furthermore, assume 


algorithm, let K-=.2 
that fast power series multiplication is being used. Then, 
using the observation that deg(Q,) = K-1, step 1 requires a 
Single unit inverse calculation, step 3 requires 18KlogK+31K 
operations, and step 4 requires N operations. The cost of 
step 3 is based on the fact that a multiplication of two 
truncated power series of degree K/2-1, each, anda 
multiplication of two truncated power series of degree K-1, 
each, is required. Therefore, the total cost of inversion is 
bounded by 36NlogN+27N unit operations plus 1 unit inverse 
computation. 

Given Algorithm 2.3.3, it becomes a simple task to 
develop a fast power series division algorithm. Given an 


arbitrary power series, A, and a zero-ordinate power series, 
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terms, by: 
Algorithm 2.3.4: Fast Power Series Division 


Input: A 1S an arbitrary power series, 
B is a zero-ordinate power series. 


Output: O&= ane mod x", 


PmCall PSINVIO, Bony 
2) Q = AQ mod x”. 
THESPLrOOE OCfBthis»algorithm ts trivial. As to its cost, let 
N = gliogn] | Then step 1 requires 36NlogN+27N operations 
plus 1 unit inverse calculation and step 2 requires 
6NlogN+6N operations, giving fast power series division a 
total cost of 42NlogN+33N unit operations plus 1 unit 
inverse calculation. As with multiplication, the fast method 
is not immediately superior. This time, the crossover occurs 
when n*-42N1logN-33N 2 0, which happens at n = 682. 

Fast division of two polynomials is a little more 
complex. If A and B are two arbitrary polynomials, then the 
following algorithm produces the quotient, Q, and remainder, 


R, which result from the division of A by B. 
Algorithm 2.3.5: Fast Polynomial Division 


Input: A,B are polynomials such that deg(B)<deg(A). 


Output: Q,R such that A = QB+R, deg(R)<deg(B). 


Comment C,D,E are polynomials. 


a 


’ 
arte pi 


o Vth ‘Oo Ate 


; E ets t rod 
el tS Aegt 
ot lalusies eaeTSevnl “ine 
oo 
v4 ite : Ton a 29e6 “aslo Me 
oa an 
4 ‘the .2 hi ‘fa ai2 
~~ We 2 
q 4 ¢ 
en ume bg ow Io not peed 
; ' 170 
FORTIES I! 16 is A Sak # EE eae 
,; So alan 
t } =| if poh (398.4 
a # : ~iJ i b 
ra iz 1.2 vf si 
; oa 
i ri en aug tt *] 
y 
ay) 


37 


=k 

~~ 

wa 
i 


[log[deg(A)-deg(B)+1]]. 


3) For i-=\0 until deg(B) do begin 


4) Fe och -xa Sy aay 


end. 
5) ecallnpsinvipscen): 
6)cRoreb,=s0euntilenairdo begin 


74 ci c daSeD; : 


end. 


8) tOe=t pan/x259) PAPEL pe 


9) R = A-QB. 


The correctness of this algorithm is proven as follows: 
DC = 1 mod x" = Fx™+1 where deg(F) < deg(B). 


geq( Binns 


Then EB = x G where deg(G) < deg(F). 


Thus R = A - QB 
deol b) eed | 


A - | (ABB) /x 
A - bia xS6o Sapa ey7e9Ssi8)*aa1) 
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Now note that deg(A) < deg(B)+n-1 and 
deg(AG) < deg(A)+deg(B). 
Then deg(R) < deg(A)-n+1 < deg(B)+n-1-n+1 = deg(B). 


Consequently, A = QB+R where deg(R) < deg(B). 


This shows that the values computed for Q and R are the 
appropriate quotient and remainder for polynomial division. 
As an aside, it should be pointed out that the operations 


performed in steps 3 to 7 are, in fact, just a different 
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form of the polynomial reciprocal procedure given by Aho, 
Hopcroft, and Ullman [1]. 

Now let N, = 2 logldeg(A)+n]] and N, = gltogideg(A)]] | 
Assume that fast polynomial multiplication is being used. 
Then step 5 of the fast polynomial division algorithm has a 
cost of 42nlogn+33n operations plus 1 unit inverse 
computation, step 8 requires 6N,1logN,+5N,+deg(A) tn 
operations, and step 9 requires 6N. 1logN.+5N.t+deg(A) 
operations. The remaining stepsS are assumed to be cost-free. 
Thus the total cost of fast polynomial division becomes 
42nlogn+34n+6N,1logN,+5N,+6N.1ogN.+5N.+2deg(A) unit 
operations plus 1 unit inverse calculation. For determining 
the crossover point for classical and fast polynomial 
division, assume that deg(A) = 2deg(B). This eae oailar 
choice is made because it represents the worst case for the 
polynomial divisions encountered in the Pade algorithms of 
the next chapter. The cost of fast division under these 
conditions collapses to 72nlogn+89n+4deg(B) unit operations 
plus a single unit inverse computation, which becomes 
cheaper than classical division at the point where 

2[deg(B) ]*-deg(B)-72nlogn-89n =e04 


This happens when deg(B) = 951. 
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3. Algorithms for Computing Pade Approximants 
Having provided some background information for 
- handling polynomials and power series via computer, it is 
now possible to turn to the discussion of algorithms for 
computing diagonal Pade approximants for a power series. The 
four algorithms considered in this chapter fall into two 
categories: one which computes approximants along the 
anti-diagonals of the Pade table, and three which compute 
them along the forward diagonals. 

As with the previous chapter, the coefficient vector 
form of denoting univariate polynomials and power series is 
used, ainn dx being used as the indeterminate. It should also 
be remembered that a unit operation in a polynomial or power 
series algorithm refers to the operations of addition, 
Subtraction, and multiplication when performed on two 
elements from the coefficient field. 

A few further definitions are also required by the 
discussion in this chapter. A power series is said to be 
normal along a diagonal or anti-diagonal of its Pade table 
if all approximants along that diagonal or anti-diagonal are 
distinct. A power series is (m,n)-normal if all (i,j) Pade 
@ooraxrimancts areedrstinct fors0se) 17s m and. 0 < <j. son, “A 
polynomial sequence is normal if the degree of each 
Successive polynomial in the Sequence is exactly one less 
than the degree of the previous one. 

Because two forms of polynomial or power series 


arithmetic can be used for the various algorithms in this 
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chapter, one final convention is adopted when determining 
the initial cost estimates of each algorithm. The function 
C,(m,n) is used to denote the cost of multiplying an 
m-degree polynomial or power series by an n-degree 
polynomial or power series while the function Cp (m,n) 
denotes the cost of dividing an m-degree polynomial or power 
series by a polynomial or power series of degree n. A 
Similar function is not required for addition and 
Subtraction since only one form of these operations exist. 
The distinction between polynomials and power series need 


not be made in the notation since the context of each 


algorithm indicates which entities are being operated on. 


3.1 The AD Algorithm 

The anti-diagonal approach to computing Pade 
approximants is based on some observations made by McEliece 
and Shearer [20] concerning the extended Euclidean algorithm 
for computing polynomial GCDs. Let Ag and A, be two 
polynomials such that deg(A,) < deg(A,). Then the extended 


Euclidean algorithm is defined as follows: 
Algorithm 3.1.1: Extended Euclidean Algorithm 


Input: Aj),A, are polynomials such that deg(A,)<deg(A)). 


Output: A,,S,,T, such that S,A,+T,A, = A, = GCaCAs, A). 


Comment A.,S:,T;,Q; are polynomials for all 1. 
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4] 


2) While A,_, mod A, # 0 do begin 


1 


3) Q.=[A,_,/A:], A,,,=A;_, mod A,. 
#) Seto weer peta ies 
5) 1 2044. 

end. 


The sequence {As} for ‘0 Ss i = kis referred to as the 
polynomial remainder sequence for Ag and A,. This algorithm 


exhibits three very valuable properties: 
Theorem 3.1.2 


Let Aj,A, be two polynomials with deg(A,) < deg(A,). 


Then 
1) S;Aj+T,A, = A; Or. SeHG= Kk 
2) deg(T,;)+deg(A,_,) = deg(A,) { astcies k 
% Aeeeryt 
3) TaA ge aoe eek. 8290-1) Ay 1 Sant ask 


fdr jall A,,S.,T; computed by the extended Euclidean 


algorithm. 


Each of these three properties can be proven using 
mathematical induction, the details of which are omitted. 
At this point, an interesting result, as presented by 


McEliece and Shearer [20], can be given. 
Theorem 3.1.3 


Let Aj,A, be two polynomials with deg(A,) < deg(A,). 


Let m,n be two integers such that m 2 deglgcd(A),A,)] 


and mtn = deg(A,)-1. Then there exist unique A, ,T, 
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computed by the extended Euclidean algorithm such that 
deg(A, ) & Mm; deg(T, ) = 2 


fonesome Cas teseck: 


Proof: 
Let i be the index such that deg(A, ) < m and 


deg(A,_,) 2 m+1. 


| 
Then deg(T, ) = deg(A,)-deg(A,_,) < m+n+i-m-1 = n, 


and deg(T. 


cra = deg(A,)-deg(A, ) Sondtneiem Sunt: 


Thus, A. and T; exist and are unique. 
OVE. 
This theorem shows that the extended Euclidean algorithm 
provides a mechanism for computing Pade approximants for a 


: +nt+ 
power series. Let Ag axe oe 


and let A, be a power series 
truncated to mtnt+1 terms, for two integers m and n. Applying 
the extended Euclidean algorithm to Ay and A, until 

deg(A, ) < m produces the polynomials A,, S,, and T, such 


that 


S;Ap + T,A, = A; => TjA, = A; mod A 
A 


where deg (A, ) <= m and deg(T, ) S'heelnis inareates that A/T; 


is the (m,n) Pade approximant for power series A It is a 


1 
trivial matter, therefore, to verify that for any arbitrary 
power series, B, and integer, d, the extended Euclidean 
algorithm can be used to obtain all the Pade approximants 


for B along the anti-diagonal in the Pade table whose 


indices, m and n, satisfy the equation mtn = d-1. 
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Having developed this technique for computing Pade 
approximants, the problem now arises as to determining the 
best method for computing the polynomials A., Ss, and T;, 
defined by the extended Euclidean algorithm. The most 
efficient method, in terms of asymptotic cost, for producing 
these polynomials is a GCD algorithm first presented (in its 
most general form) by Moenck [21,22]. To facilitate the 
presentation of this algorithm, the following definition is 


made. 
Defanitteont 33424 


Let iQ; 3 be the sequence of quotients computed by the 
extended Euclidean algorithm for 0 < i < k. Then the 


matrices M. ; are defined as follows: 


Lf 


t0 
1) OM. the= OusnTeEs ky 


0. QO 1 
ne te Oo<e3 sate< kk. 
rJ 1 Oh, 1 TOs 


To make effective use of these matrices, the following 
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properties are also required. 


Theorem 3. 1.5 
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A; Le 
2) = Mr .| 2 Oo ears eek 
he roid Pe 
iva J+ 1 
S; T, 
3) Meo = : ORS osork 
i+] “i+ 


where A;,S;,T; are as defined by the extended 


Euclidean algorithm. 


The first property is a direct consequence of the definition 
while the other two properties can be proven by trivial 
inductive arguments. 

The efficient GCD algorithm mentioned above actually 
consists of two separate algorithms: one which computes the 
two polynomials which lie at the exact middle of the 
remainder sequence produced by the extended Euclidean 
algorithm, and another which uses the first one to compute 
the rest of the polynomials in the remainder sequence. Only 
the first one is presented here since it is shown later that 
the other is not required for computing diagonal Pade 
approximants for a power series. Let Ay and A, be two 
polynomials such that deg(A,) < deg(A,). Then the EMGCD 


algorithm, given below, computes the matrix 


M = 
So S44 To 


where deg(A,,,) < [deg(A,)/2] < deg(A,) and A,, A S 


og fen 


Siaqe Ts, and T,,, are as defined by the extended Euclidean 


algorithm. 
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Algorithm Slelie: EMGCD(M,A),A,) 


Input: Aj),A, are polynomials such that deg(A,)<deg(A,). 


GUEDUEs aM = 


sate he Say Tp 
where S.AjtT.A, = A., Ss 44497 Toe ag Asay? 


deg(A,,,) < [deg(Ag)/2] < deg(a;). 


Comment By B, CoC, /D,E,F,O,G, 1G,,H),H, are 
polynomials, 
M)7Mj MoM; are matrices. 
Ay 1 0 
pe boa 0 cG deg(A,) < deg(A,)/2 then M = 
Aes One| 
else begin 
a m=[deg(A,)/2]. 
m m 
3) By=LAg/x |, Co=Ag mod x . 
4) B ,= LA, ane C,=A, mod x” 
5) Call EMGCD (M, 1By,B,). 
want 
6 ) Mae 
OF 041 
D 
= m il 
7) | | = M,[x Cy, C,] 
E 
D 
8 ) If deg(E) < deg(A))/2 then M = , M, 
E 
else begin 
9) Q=|D/E|, F=D mod E, k=2m-deg(E). 
k k 
10) Goy= LE/x 1. H=E mods x. 


ii) G.=|F/x"|, H,=F mod a 
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12°) Call EMGCD(M., 1GyG,). 
eho tayay. 
13) M. | ; 1 
14) Mee | M, [x Hy H,]* ; 2), be | 
oe’: 
end. 
end. 


This version of the partial GCD algorithm corresponds to 
that given by Brent, Gustavson, and Yun [5]. 
In order to prove the validity of the EMGCD algorithm, 


a couple of preliminary results must first be given. 
Theorem 3.1.7 


Let A,B be two polynomials such that deg(B) < deg(A). 


Let A, rAnBa, B. be polynomials such that A = A,xS#AS, 


Bees ace where k < 2deg(B)-deg(A), deg(A,) < k, 


1 2 
deg(B,) < k. Finally, let Q = [ACE yes bay Bly 


R= A-OB, Ri = A/ 70,8) then 
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Proof 
Let: Re = A-Q,B 
k k 
= (A, x +A, ) -Q,(B, x +B.) 
= (A, -Q,8, )x Ke (A. -Q,B, ) 


e as 
R,x +A,-Q,B, 


Now, deg(R,x*) = deg (R,)+deg(x*) 
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< deg(B,)+k 
= deg(B), 


and deg(Q,B.) = deg(Q,)+deg(B,) 


Thus, deg(R') 


< deg(A,)-deg(B,)+k 
= deg(A)-deg(B)t+k 
< deg(A)-deg(B)+2deg(B)-deg(A) 
= deg(B). 
< max[deg(R,x*) ,deg(A,),deg(Q,B,) ] 
< max[deg(B),k,deg(B) ] 


=edeg(B) . 


jhere tore Ae) 0,3) tR where deg(R') < deg(B). 


Consequently, Q = Q, by uniqueness of quotients. 


Now, for convenience, let n = deg(A)-deg(B). 


Then |R/x 


Thus, 


fab) 


= [R'/xD**) 
Beas 


[(R,x*+A,-0,B,)/x 
[Ryvx | 
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The importance of this theorem, other than displaying an 


interesting property of the quotients of polynomials, is 


that it is required for proving the following result. 


Theorem 3.1.8 
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wnheresdeg(b-)y= [[deg(A,) =k}/2| <= deg(B ,). Then 


bel ee 


where deg(A,,,) < [[deg(A 9) tk1/2] < deg(A,). 


Proof: 
The proof of this theorem is by induction. 
* 
deg(Aj), aig res deg(A._4), 
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For convenience, let n 


n' = deg(A pei deg(A.), r= deg(B. 
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Venn! 


Now suppose that 2degqtBs)-degiB. | 


Then, by substitution, 
2deg(B.)-deg(A._,)+k < deg(A,)-deg(A._,) 
=> deg(B.) < [[deg(A,)-k]/2]. 
Also, by adding k to each side, 
2deq(B,)-deg(B._,) +k <n 7k 


=> 2deg(A.)-deg(A;_,) < deg(Ay)-degla,_, 


=> deg(A.) < [[deg(Ay)+k]/2]. 


)+k 


Thus, in this case, the theorem holds with B. =n 


a fect he 
Ces a i Bay Coane eke Mey eS a edel ee g 
forepotn BB, and AgrA, as desired. 
Therefore, assume n-n' < 2deq(B.)-deg(B,_,). 
Then [[deg(Ap p—k 1/2) = heed ( B. ee 
Also, n-niak = sachts,)-Aéain tisk. 
= 2deg(A;)-deg(A._1), 
which means that [ [deg (A,)+k]/2] < deg(A.). 
Thus the third assumption holds by induction. 
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Also, again by Theorem 3.1.7, 
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Consequently, the first assumption also holds by 
induction and the theorem is valid. 
OA Tr By 
It is the result in this theorem that forms the basis of the 
EMGCD algorithm. 
The proof of the EMGCD algorithm's correctness can now 


be given, using induction, as follows: 
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A, and A, such that deg(A.,,) < [3deg(A,)/4] < deg(A.) 


and deg(A,,,) < [deg (Ap a < deg(A,). 


Assume that the recursive procedure calls produce the 


proper result. 


First, it 26° shown that M, = M. 0 


By the inductive hypothesis, 
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Therefore, by induction, the algorithm produces the expected 
result. 

For the purpose of determining the complexity of this 
algorithm, a number of assumptions are made. First, it is 
assumed that deg(A,) = deg(A,)-1 and that the remainder 
sequence for the input polynomials is normal. This 
represents the worst: case (highest cost) situation for this 
algorithm. Secondly, it is assumed that n = deg(A,) is an 
exact power of 2. This is done for notational convenience 
mostly, and does not detract from the complexity equation 
which results, since the same convention is adopted for all 
Pade algorithms discussed later. The major consequence of 
this last assumption is that [n/27] = [n/2?| = n/2? for all 
2 <n, which is valuable in reducing the number of terms in 
the final cost estimate. 

Based on these assumptions, the following statements 


can be made concerning the sizes of various intermediate 


results: 
deg(By) = p72 deg(Cy) = n/2 
deg(B,) = (ny tea deg(C,) = n/2 
deg(D) = (3n/4)+1 deg(E) = 3n/4 
deg(Q) = 1 deg(F) = (3n/4)-1 
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deg(G)) = CTARY <2 deg (Hy) =n an 4 
deg(G,) = (n/2)-3 deg(H,) = On Al 1 
deg(M,) = n/4 SHG ete 40a Ae) | 


It should be noted that the degrees of matrices M, and M. 


| 
are assumed to be the same as the degrees of the largest 
element in each matrix. Under these conditions, the cost of 


each step in the EMGCD algorithm can be outlined as given 


below: 


Si @ ars came ft 

Step 4: nti 

Step 53) TCin/2)—1) 

Step 7: 4c, (n/4,n/2)+(9n/2)+7 

Step 9: C,((3n/41+1,3n/4) 

Step 10: (3n/4)+1 

Step 11: 3n/4 

Step 12: T([n/2]-3) 

Step 14: ac, ((n/2)-1, [n/4)+1)+2C,,(1,{n/4]-1)+ 


8C,,(n/4,n/4)+(11n/2)+13 


The remaining steps have no associated cost. After a few 
minor modifications, this brings the total cost of the EMGCD 


algorithm to: 


Gn = 2T(n/2)+C,([3n/4]+1,3n/4)+4C,,(n/4,n/2)+ 


12C,,(n/4,n/4)+2C,,(1,0n/4]-1)+(27n/2)+24 


Using classical polynomial arithmetic, this cost becomes: 
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oT (n/2)+(5n“/2)+(73n/2) +58 


ry 
2) 


5n“+(73/2)nlogn+53n-58 unit operations, 
while with fast polynomial arithmetic, this cost is: 


Tien) 


2T(n/2)+114nlogn+104n+178 


57nlog’n+16 inlogn+178n-178 unit operations. 


Note that this cost is based on the size of the smaller 
input polynomial. This is not a normal practice. However, 
when using this algorithm to compute Pade approximants, as 
outlined shortly, the size of the larger input polynomial is 
always exactly one more than the size of the smaller one. 
Therefore’, in thisecase, thisrconventionsisevalid. It is 
done because it simplifies the cost comparisons of the 
anti-diagonal approach for computing Pade approximants with 
other methods. 

Having completed the description of the EMGCD 
algorithm, it can now be used to compute the reduced (n,n) 
diagonal Pade approximant for an arbitrary power series, A, 


in the following way: 
Algoraithm) 3 Wot ADCP) OpAgn ) 


Input: A 1S a power series, 
n is an integer. 
Output: P,Q such that P/Q is the reduced (n,n) diagonal 


Pade approximant for A. 


Comment Ma'S ‘acmatrix. 
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The correctness of the first step of this algorithm has 
already been verified. Step 2 is simply an exotic, but 
mathematically concise, way of setting the values of P and Q 
to be the same as specific elements of matrix M. The purpose 
of the last two steps is to put the rational function P/Q 
into reduced form. That this particular method for doing so 


is correct is quaranteed by the following result: 
Theorem 3. 1.-1,0 


Let A,,T, be as defined by the extended Euclidean 


algor nthmaio yal] “Ow ii imaks Then gcd(A, ,T,) | Ay: 


This result is a direct consequence of the third property 
given in Theorem 3.1.2. The implication of this property to 
the anti-diagonal approach for computing the (m,n) Pade 
approximant is that any common divisor of any numerator and 
denominator derived using the extended Euclidean algorithm 


Rone mn | 
must also divide x 3 


Therefore, any common divisor of 
such a numerator and denominator, and, thus, of P and Q in 
Algorithm 3.1.9, must be an integral power of the 


indeterminate, x. 
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Determining the cost of Algorithm 3.1.9 is 
Straightforward. The cost of step 1 is the same as the cost 
of the EMGCD algorithm when deg(A,) = 2n while the cost of 
step 4 is 2n unit operations. Therefore, using classical 
arithmetic, the cost of the anti-diagonal approach to 


computing the reduced (n,n) diagonal Pade approximant is: 
Pon) = 20n°+73nlogn+181n-58 unit operations, 
whereas using fast polynomial arithmetic, this cost is: 


T(n) = 114nlog’n+550nlogn+794n-178 unit operations. 


3.2 The MD Algorithm 

The first of the forward diagonal algorithms is the MD 
algorithm presented by Brent, Gustavson, and Yun [5]. This 
algorithm uses another recursive algorithm to compute the 
(n,n) diagonal Pade approximant for an (n,n)-normal power 
Series quotient. Let A and B be two power series such that 
ord(A) = ord(S) = 0 and A/B is (n,n)-normal. Let s be an 
integer with a value of zero or one. Then the reduced 
(n+s-1,n) and (n+s,n) Pade approximants for A/B can be 


computed using the following algorithm: 
PeCorLehms. 2s eMD2IM A, Bens) 


Input: A,B are polynomials such that ord(A)=ord(B)=0 and 
A/B 1s (n,n)-normal, 
n is an integer, 


s is an integer such that s=0 or s=1. 
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| where P/Q is the reduced (n+s-1,n) Pade 
approximant for A/B, 
and P/Q is the reduced (nt+s,n) Pade 


approximant for A/B. 


Comment C,D,E,F are polynomials, 


M,M, are matrices. 
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1) itnt+s*s 0 athen M = | q else begin 
1b 
0 
2) k = |[n/2]. 
oy, Ges inak, 
4) i Et GENS = 5 
5) Call MD2(M,,B mod rn mod et eae 
CG -A 
6) = M, é 
D B 
= B=|C/x*9| mod ne F=(D/x™tot! mod ee 
8 ) Cala MD2(M.,F,E,k,0). 


QQ -x 0 
9) = M M.. 
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10) M = 


end. 


The proof of this algorithm's validity is, again, by 


Tnavuction. 


Letor. see denote the (i,j) Pade approximant of a 


power series. 
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Furthermore, ord(P,) = ord(P, ) = Ord (Ow) a= ord(Q, ) = 0 
for =ia15 25 
Because of normality, ord(E) = ord(F) = 0. 


Thus, PB-QA = (RePa-eP O5)B=(O9P,nxQnOs) A 


(P,B-Q,A)P.-x(P,B-Q,A)Q. 
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Also, deg(P) max[deg(P,)+deg(P.) ,deg(P,)+deg(Q,)+1] 
=9n+s-ih, 
deg(Q) = max[deg(Q,)+deg(P,) ,deg(Q,)+deg(Q,)+1] 


= Nn, 


ord(P) = OndME Pe) = 0 since ord(xP,Q.) 


| 
— 
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ord(Q) = cudi Osby = 0 since ord(xQ,Q.) 

Therefore, P/Q is the reduced (n+s-1,n) Pade 
approximant for A/B. 

By an identical argument, P/Q is the reduced (n+s,n) 


Pade approximant for A/B. 


Thus, the algorithm behaves as stated. 
For determining the cost of the MD2 algorithm, a few 


conventions are again adopted. In particular, it is again 


gé 
; ar 
) “S48 
( \Al, 
of 
i he 
\ 
. ae 
shytae 367 _ 
= 7 _ 
fi «Q Seo. G9) = 
— 8 
ry it i} = 
‘ ace 2 - 
5 i 
i : #1 
rid 
4 
* ° ° ’ ~) a | ( 
ao 
” ’ ‘ 
£/2Ta 3 =\ (9. Ga bs 
' VE } 29:0) \ 
net i 
a shesiin &. ae hed oo Lh ae 
2, (o,f Zen ‘bSoumrsa ody er O15 , extern 
'@Se ae 
‘o- ; 
Bs . enrk wo Raye 
_ > ued 
. ) Baws : eho e: O49 ,tfemepts les ihzaebhl ae 
. | oo 
7-7. jay? Ing rj} ogame 2 a | 
.Beies72 25 advetied a3 hope: 
: a 
3 ,meads’ et ests on «sts pnicias 
7 i - oi 
ay : fhe cae ahi 
ey €s , fal 7’ 7) “ D6 MLeEpe e 


59 


assumed that n 1S an exact power of 2. Furthermore, it is 


assumed that s 1. The effect of these two assumptions is 
that k = d = m = n/2. Under these conditions, and 
remembering that A/B is (n,n)-normal, the sizes of various 


intermediate results can be given as follows: 


deg(A) = 2nt+1 deg(C) = [5n/2]+1 
deq(B) ==92ni4 deg(D) = [5n/2]+1 
deg(M,) Sry deg(M.) Een) 2 


In fact, matrices M, and M, each contain one polynomial of 


1 Z 

size [n/2]-1 and three of size n/2. However, setting the 
Size of all to n/2 should only have a very minor effect, for 
large n, on the final equation. Based on these values, the 


cost of each step in the algorithm can now be summarized as 


below: 


Step 52) any) 

Step 6: 4C,,(n/2,2n+1)+5n+4 

Step 7/3) Sone? 

Step 8: T(n72) 

Step 9: 4C,,(n/2,[n/2]+1)+4C,,(n/2,n/2)+5n+12 


Step 10: 4n+8 


The remaining steps are cost-free. Thus, the total cost of 


this algorithm becomes: 


T(n) = 2T(n/2)+4C,,(n/2,2n+1)+4C,,(n/2,[n/2]+1)+ 


4C,,(n/2,n/2)+22n+31. 
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60 
Using classical polynomial arithmetic, this cost becomes: 


2T(n/2)+12n¢+66n+71 


24n°+66nlogn+47n-71 unit operations, 


Th) 


whereas uSing fast arithmetic, it becomes: 


Tne) 2T(n/2)*192nlogn+470nt 3 1 


96nlog“n+566nlogn+31n-31 unit operations. 


It 1S interesting to note that no polynomial or power series 
division is required by this algorithm. This accounts for 
its good performance when using fast arithmetic. To offset 
this, however, is the fact that A/B must be (n,n)-normal. 
Brent, Gustavson, and Yun [5] mention that this normality 
requirement can be relaxed somewhat. However, then the 
resulting rational function would not necessarily be in 
reduced form anymore. Furthermore, they indicate that to 
remove the normality condition entirely deteriorates the 
complexity of the algorithm. They do not give any further 
information about either claim so these points are not 
pursued any further. 

To compute the reduced (n,n) diagonal Pade approximant 
for any (n,n)-normal power series using this technique, all 
that is now required is an appropriate call to the MD2 
procedure. If A is an (n,n)-normal power series and on is an 


integer, this 1S accomplished as follows: 
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Algorithm 34222:)MD¢(Ps0,A,n9) 


Input: A is an (n,n)-normal power series, 
n 1S an integer. 
Output: P,Q such that P/Q is the reduced (n,n) Pade 


approximant for A. 
Comment M iS a matrix. 


1) Cal leMD2Z4MS At 00}. 


The validity of this algorithm is directly dependent on the 
correctness of the MD2 algorithm. The cost of the MD 


algorithm is identical to the cost of MD2. 


3.3 The DIAG Algorithm 

The next forward diagonal approach for computing 
diagonal Pade approximants is an iterative method developed 
by Cabay and Kao [6]. Given two consecutive diagonal Pade 
approximants, this technique computes the next one in the 
sequence based on the given two. Let A be an arbitrary power 
series truncated to 2n+1 terms. Then this approach computes 
the reduced (n,n) diagonal Pade approximant by the following 


algorithm: 
AL Qorethmes 35 4a DRAG (PHO, A, no) 


Input: A iS a power series truncated to 2n+1 terms, 


a aL fl. 

Gout Asc Thee 

ae aa | 

ay. | : g — 

H ay 

J tv = ay a 
L* ; 

Th : 2 ott 

| 

A ip - .- 

ic zits to yee iawe 

| _ 

i? Jo. 82 zona 


a 


~ 
am = —s | 
oD 
ee e 4 . = 
? - ba ’ ae — . 
t) 4 . ’ 2 4 meg ia fa 


a | 
} . 
3 ia 4 ic 
5 ~ a of 
—h 
, 2 Dad i 
. i re. ; Tae! r . x 
- . 
- c 4 a 
( 7 MS i o> 


ro 7 
_ | wo 


62 


n 1S an integer. 
Output: P,Q such that P/Q is the reduced (n,n) Pade 


approximant for A. 
Comment B,P, ,Q,,R, are power series for all k. 
P, P Ae 
Q5 Q_, 1 0 
2) R_,=1, k=0, j=-1, i=0. 


3) While i <n do begin 


4) qa = ordl | (AQ, )/x?** "| ]+i4t. 
a) Phtdesanethenabegqin 
6) Ry = L(AQ, 9 /x0"? | mod poe 
7) Piste LCA Maan Sl itiod x0 sg 
8 ) B= (Ry _,/R,) mod x? 2" !, 
cuca sean See Ee 
ke QO 4] Lx? 
10) Korea etiie 
end 
dete) j=.i 
t2} izd 
end. 
13) P=P., Q=Q,- 


The division in step 8 must be done using power series 
division. The division by the power of x in steps 4, 6, and 
7 must be done using polynomial division. All other 
calculations involving power series can be done using either 


power series or polynomial arithmetic. In the case of this 
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algorithm, polynomial arithmetic is more efficient. The 


proot ot this algorithm is by induction: 


Assume that P/Q, is the reduced (i,i) Pade 
approximant for A and that Dea rea is the 
reduced (4,3) Pade approximant for A, with i > 5. 


Assume also that ord(Q,) = ord(Q, _,) =n0r, 
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deg(Q,,,) = max[deg(Q, )+deg(B) ,deg(Q,_,)+d-j] 


lA 


d, 
ord(Q, , 4) = ord(Q,B) ='g0 

Since ord(Q,_,x 

Now, observe that Bae eer er ee = 

This can be verified by induction: 

a 2 * Sic we ee d-3 
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d+i 


This means that gcd(P X 


K+1'2k+1) | 
However, because ord(Q, ,,) = One Heel: gcd(P,.1,2,44)- 


Therefore, gcd(P ‘lan 


K+ Oke) 

Consequently, Sen le hese is the reduced (d,d) Pade 
approximant for A. 

The algorithm terminates when d 2 n+1. That this 
actually occurs is guaranteed by the fact that 
di =i. 

At fthatvpoint} P/Q, forms the reduced (n,n) Pade 


approximant for A. 


Therefore, as calculated by this algorithm, P/Q is the 
reduced (n,n) Pade approximant for A. 

Before performing the cost analysis for the DIAG 
algorithm, a few comments should be made regarding some of 
the calculations performed in this algorithm. Consider, 
first, the problem of computing the value for din step 4 of 


the algorithm. One way to obtain the value of 
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ordl | (AQ, )/x***")] is to actually compute the value of AQ, 
to 2n terms and then look for the first non-zero term past 
the (2i)th coefficient of the result. Another approach is to 
only computeccoeffacients 2141) 0s), dti of AQ, until the 
(d+i)th coefficient is non-zero or until d+ti > 2n. Using 
this latter scheme, step 4 of DIAG can be replaced with the 


following sequence of calculations: 


So eC= nt O= 2 te 


4b) While c = 0 and d < 2n do begin 


4c) ears hai 
4d) For m = 0 until i do begin 
4e) Ca= aoe ely sak 
end. 
end. 


46) If cPer0wthenid *=untitetse de=ndair 


where Cie denotes the mth coefficient of Q,- Ina Similar 
fashion, the value of Rie computed in step 6 of DIAG, can be 
Svrained by Constructing only, coer hiclents. dti,...«,co.0f the 


product AQ), which can be done by the following method: 


6a) FOt«m = d+i1 until 2d doxbegin 


6b) Caras OF 

6c) For s = 0 until i do begin 

6d) Ek m . "km. om=sak.s* 
end 
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coefficient of Ri and ages is the sth 


coefficient of Q,- Note that if the alternate technique for 


step 4 1s also used, 


then r 


Redes = c where c 18S the last 


coefficient calculated by the first technique and the value 


for m in step 6a, above, 


At the beginning 
at step 5 of the DIAG 

: itj 
Roy = [AQ )/ 79] 


previous execution of 


starts at#dait+ leratherwthanveri. 


of each execution of the loop starting 


algorithm, Ry contains the value 


mod xcrae. 
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as left over from the 


ehiselocp.athus) fatesteparLotethe 


algorithm, it is only necessary to perform any calculations 


Pinel <ad7t) ,caniwhich casetondy coefficients 2it1% ..cydt ] of 


Bhe Pproduct AQ, _ | are required. Consequently, step 7 of this 


algorithm can be replaced with: 


favmlisdty.-o2i Chenubegqin 


7b) Form = 22741 until d+7.do0 begin 


7c) Petes ale 


7a) For s = 0 until j do begin 
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end. 
end. 


end. 


where Epa pets the mth coefficient of Ry and Tk-15s 1s 


the sth coefficient of Q.-45 
The three techniques presented above are only valuable 


if they represent an improvement over the original methods 


for producing the same results. A careful examination of 
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these techniques shows that each computes a partial power 
series (or polynomial) product using classical power series 
multiplication. Thus, each is an immediate improvement, by 
necessity, over computing the complete product using 
classical arithmetic. That they are an improvement over 
computing the total product using fast power series 
multiplication results from the fact that, in the worst 
case, steps 4, 6, and 7 of DIAG are each executed n times. 
It is shown shortly that using these three techniques then 
results in a total algorithm complexity of O(n“). On the 
other hand, computing the total product using fast power 
series arithmetic results in an O(n“logn) algorithm. 
Therefore, computing only the partial products in this way 
represents a clear improvement over computing the total 
product uSing either form of power series arithmetic. It 
should be noted that a partial product can not be obtained 
uSing the fast power series multiplication technique without 
first computing the total product. Thus, no further gain can 
be obtained in that direction. 

For determining the cost of the DIAG algorithm, assume 
that the worst case situation exists, which again occurs 
when the input power series, A, is normal along the main 
diagonal. In that event, the following bounds hold on the 


sizes of various calculated values: 
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Then, assuming that the three improvements presented above 
are used, the cost of each non-trivial step of DIAG, 
excluding the loop starting at step 5, can be given as 


follows: 


Step 43 2k+2 
Steps: 2k+2 
Step. 70 


Step 8: Cy(1.d) 


\O 
NO 
©) 


Step (k,1)+4k+8 


This brings the total cost of the algorithm, as determined 
by the loop starting at step 5, to: 
1 


es 
3G py ake ae 
k=0 


[2c,,(k,1)+C,(1,1)+8k+12]. 
Using classical polynomial multiplication and power series 
division, this equation collapses to: 

ned 5 

Tn) =z (12k+20) s="6ner14n unit operations. 

k=0 
Cabay and Kao [6] derive the same bound for classical 
arithmetic usage without the normality assumption. Using the 
fast arithmetic algorithms, and assuming that n is an exact 


power of 2, the total cost becomes: 


nisi 
xX (12mlogm+10m+8k+162) 


Ton} t= 
k=0 
= 8n“logn+8n-+48nlogn+246n-48 unit operations 
where m = gllog(k+2)] | It 1S interesting to note that for 


this particular algorithm, use of the fast arithmetic 
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techniques actually results in a poorer complexity for the 


worst case situation than use of the classical methods. 


3.4 The OFFDIAG Algorithm 


The final algorithm considered for computing symbolic 
diagonal Pade approximants was developed by Choi [7] in an 


attempt to improve on the complexity of the technique used 


in the DIAG algorithm. The result is an algorithm with a 
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better asymptotic cost which can be used to compute the Pade 


approximants of a power series on or below the main diagonal 


of the Pade table. Let A be a power series truncated to 


m+n+1 terms where m and n are two integers such that m 2 


Then the OFFDIAG algorithm produces two Pade approximants 


for A along the diagonal of the Pade table for which 


nN. 


i-j] = m-n. One is the reduced (m,n) Pade approximant for A. 


The other is the last different Pade approximant for A to 


appear on the same diagonal just before the (m,n) Pade 


approximant. 


Algorithm 3.4.1: OFFDIAG(M,A,m,n,m,n) 


Input: A iS a power series truncated to mt+n+1 terms, 


m,n are integers such that m2n. 


Outputs -M 


P P 
| f where P/Q is the reduced (m,n) Pade 
QQ 
approximant for A, 
and P/Q is the (m-1,n-1) Pade 


approximant for A, 


m,n such that mtn=ord(AQ-P)-1 and m-n=m-n. 
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vi x ; 
Comment A ,R,R are power series, 


: oe : 
Mn WS) ae matrix. 


4) While 2* < 2n do begin 


5) ifene> 2} then m = 21h elsesm =4ncns 
6) R = Hag) /xmtnt ly hack. 
72) If R # 0 then begin 
8) d* = ord(R)+1. 
9) n= md. 
10) R= IR/xt 1, 
Se * x 

18) R = AG) jee te mod x™ 72 te 
125) A* = (R/R) mod men 1 
13) - Call OFFDIAG(M*,A*,m*,n’,m ,n). 

P P PP] iit) AO . 
14) . : : | ; cate | 
15) m=m+m, n=n+m. 

end 
16) ja =a 
end 


Note that the Pade approximant computed as P/Q is not 


necessarily in reduced form. As with the DIAG algorithm, the 
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division in step 12 of OFFDIAG must be done using power 
series division while the division by the power of x in 
steps 6, 10, and 11 must by done using polynomial division. 
All other power series operations can be done using either 
power series or polynomial arithmetic, although for step 14, 
the polynomial methods are more efficient. The proof of the 


OFFDIAG algorithm is very similar to that for DIAG: 


For notational convenience, replace step 14 with 
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Furthermore, ord(R) = ord(AQ-P)-m-n-d = 0. 


Similarly, ord(AQ-P) = m+n-1, ord(R) = 0, and 
Po) Ge 
m+n 


- - * 
AQ-P = (Rx? bs mod. xo # 
* 


Because ord(R) = ord(R) = 0, ord(A ) = 0. 


The inductive step of the proof proceeds as follows: 
a*+1% - g*4+1_# 


AQ'=P™ = (OP “Ox o*)-(pp*-px O7) 
* 
a GhO-P)P —Wag-Pyx>® |o® 
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Since | # 0, this implies that 
AQ'-P' = 0 mod ,, (mem) + (n+m" ) +1 
Also, deg(P') = max[deg(PP*) ,deg(PO*)+a*+1] 
< mtm’, 
deg(Q') = max[deg(QP*) ,deg(00*)+a*+1] 


- =x 
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Thus, P'/Q' is the (m+m’,ntm*) Pade approximant for A. 
a ~ ty - -k | = -% 
By a Similar argument, P'/Q' forms the (m+m -1,n+m -1) 
Pade approximant for A. 


By induction, ord(P') = min[Lord(A),ord(PQ*)+da*+1], as 
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erd(Pijt= min[ord(PP*) ,ord(PQ*)+da*+1] 


min[ord(P)+ord(P*),ord(PQ*)+d*+1] 
= min[ord(A)+ord(A’) ,ord(PQ’)+d*+1] 
= min[ord(A) ,ord(PQ*)+d*+1]. 
THUS.9 De particular: ord(P*) = 0 since ord(A*) = 0. 
Then. Ora (0)) = min[Lord(QP") ,ord(QQ*)+d*+1] =1 0. 
Now observe that P'Q'-Q'P' = Bx’ for some k = 0. 
Gi) Wit h AK ko 5 
Assuming that yPO-OP ="—x stand FP O1-0-P = )=x- for some 
h and j, this can be verified by induction: 
rh! 'pr * a = =k Fae aie 
P'O'=0'P' = [ (PP -x OVE) (OPv=x OMO8 ie 


a lie a eee 


[ (QP*-x 0*0) (pp*-x 62D 
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x? *|1o*pop*+pp*0*0-0 OPP -OP*0'P] 


-x2* 1 (pop*9*-oBp*O*) + 
(QPQ"P*-PQQ"P*) J 
= -x2* 11 (pgp*S*-QBP*S*)- 
(PQQ"P*-QPQ"P*) J 


x2°*1(p3-98) (P*O*-0*B*) 
= RS Aly ex heg aa 
= —x 

where k = ht+j+d*+1. 

This means that gcd(P',Q') | ak, 

However, because ord(Q') = 0, x } gcd(P',Q'). 

Merefore,mqed(P™,0') = 1. 

Consequently, P'/Q' is in reduced form. 


It is now a trivial task to verify that at some point, 


P'/Q' forms the (m,n) Pade approximant for A. 
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The details of the fact that this occurs when i 2 logn 


are, bereeroucno. [7.)% 


This then completes the verification for this algorithm. 

Like in the DIAG algorithm, the calculations performed 
in steps 6, 10, and 11 can be performed either as stated or 
by computing only a partial power series product using the 
classical power series multiplication technique. However, in 
this case, use of the second alternative results in a poorer 
complexity when fast power Pe eaanpiclicscian is used. 
This occurs because of the fact that the covering loop, 
Starting at step 4 of OFFDIAG, is only executed logn times 
whereas in DIAG, it 1S executed n times in the worst case 
Situation. Consequently, use of the partial multiplication 
scheme is not considered for the OFFDIAG algorithm. 

Assume now that the input power series, A, is 
(m,n)-normal, which represents the worst-case situation for 
the OFFDIAG algorithm. In that case, the following 
assertions hold about the sizes of various values before 


step 14 of the algorithm is executed: 


deg(P) =m deg(P*) = m* 
deg(Q) =n deg(Q*) = n* 
deg(P) = m-1 deg(P*) = m*-1 
deq(O)e=onas deg(Q*) = n*-1 
deg(R) = m +n* deg(A*) = m+n” 


deg(R) = m* +n 
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algorithm can be summarized as follows: 


Step 6: C,,(2m-+m+n ,n)+4m"+2m+3n+2 


Step 16: 2m* 
Step 11: C,(m +n =+m+n-1,n)+2m=+2n"+2m+3n 
Step 12: C.(m*tn*,m*+n*) 


Step =13:of(m in 


Step 14: 4C,(m,m")+4C,,(n,m")+8m™+2m+2n+8 


Since the major concern of this discussion is diagonal Pade 
approximants, let m =n, which then implies that m = n. 


Assume further that n iS an exact power of 2. Then, when 


i = 0) m= f Senen=iOtandem =* 1. while if i > 1, 


= * 5 eae | * iv] 


m=ne=m = 2 andiin. =/72 -1. This means, after a 


couple of minor modifications, that 


Tat.n = Cy(1,1)+2C,,(2,0)+8C,,(0,1)+26, and 


p(2') = (2,8 sO. ( 2erae emai zen | 2284 


Bent 2aiM, 2: 


- ly s16¢21)+10 


where i > Q. This brings the total cost of the OFFDIAG 


algorithm to: 


Tee = 2T(n/2)+C,(n-1,n-1)+2C,,(2n,n/2)+8C,,(n/2,n/2)+ 


16nt+C,(1,1)+2C,,(2,0)+8C,,(0,1)+36. 
Using classical arithmetic techniques, this cost becomes: 


T(r.) 2T(n/2)+9n-+42n+104 


18n-+42nlogn+160n-104 unit operations. 
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The fast arithmetic methods produce a total cost of: 


rn) 2T(n/2)+186nlognt+361n+498 


I 


93nlog’n+454nlogn+986n-498 unit operations. 


To use the technique of the OFFDIAG algorithm for 
computing the reduced (n,n) diagonal Pade approximant for an 
arbitrary power series, A, it iS now only necessary to 


invoke the following procedure: 
Algorithm 3.4.2: OFFDIAGnn(P,Q,A,n) 


Input: A 1S an arbitrary power series, 
n is an integer. 
Output: P,Q such that P/Q is the reduced (n,n) Pade 


approximant for A. 


Comment: Missa matrix, 


1,j] are integers. 


t) &@CabbsOFEDIAG(M,Ajn} nei?» 


The validity of this algorithm is guaranteed by the 
correctness of OFFDIAG. The cost of this procedure is 


identical to the cost of the OFFDIAG algorithm. 
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4, Practical Analysis 

In order to properly present an empirical analysis of 
the diagonal Pade algorithms, it is necessary to consider 
three practical aspects pertaining to empirical timing 
results. To begin with, some attention must be paid to the 
selection of the actual field used as the polynomial and 
power series coefficient domain. Next, some specifics of the 
implementation must be included. Finally, and most 
importantly, the timing results, and how they were obtained, 


must be presented and discussed. 


4.1 Selection of the Coefficient Field 

Up to this point, it has been assumed that the 
coefficients of all polynomials and power series lie in an 
arbitrary field. For a practical implementation, it is 
necessary to select a specific field as the coefficient 
domain. One of the more popular classes of fields used for 
this purpose are the finite fields, or Galois fields, 
consisting of the integers modulo a prime integer. For any 
given prime integer, p, such a field is often denoted by 
GF(p). The presentation of the nature and properties of such 
fields are outside the scope of the current discussion. 
Details can be found in any introductory text on number 
theory, for example, in Shockley [25]. 

Although adequate for many algebraic applications, the 
use of GF(p) for any arbitrary prime integer, p, is not 


completely sufficient for the problem at hand. The reason 
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for this is that the fast polynomial and power series 
arithmetic techniques require the use of nth roots of unity, 
for specific values of n. Verifying the existence of roots 
of unity, and finding them if they do exist, is a 
non-trivial task. Some assistance is provided by the 


following result: 
Theorem 4.1.1 


Let p be a prime integer. Then GF(p) has a primitive 


PEnRnOOEEOSEUNIty tion J (pat). 


The proof of this theorem can be found in Shockley [25]. 
Using this result in conjunction with the fact that n always 
has the form n = 2k for some k > 0, it becomes immediately 
apparent that the optimal choice for GF(p) is where p is a 


Fourier prime. 
Definitions. 132 


A prime integer, p, iS a Fourier prime if p = c2 +1 


for some c,m > 0. 


Some valuable work regarding the existence and practical use 
of Fourier primes has been done by J. Lipson. Details of his 
results can be found in Lipson [19]. The actual Fourier 
prime used in the present implementation will be given 
shortly: 

One item that has been largely ignored to this point is 


the problem of how to compute inverses within the 
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coefficient field. The major reason for this delay is that 
such calculations tend to be very dependent on the domain 
within which they are done. In GF(p), multiplicative 
inverses can be computed uSing a couple of different 
techniques. The better of the two, as determined in an 
analysis by Collins [8], is the extended Euclidean 
algorithm. Since gcd(b,p) = 1 for any non-zero value b in 
GF(p), the extended Euclidean algorithm can be used to 
compute, uSing standard integer division, values c and din 


GF(p) such that 
Ghat Op e= 51 => oD =~) mod p. 


Consequently, c is the multiplicative inverse of b in GF(p). 
A result attributed to G. Lame, as given by Horowitz and 
Sahni [14], states that the number of divisions required by 
the extended Euclidean algorithm in this case is no more 
than Slog, yb. This means that for any given p, the number of 
operations required to compute the inverse of a non-zero 


element in GF(p) is bounded by a constant. 


4.2 Implementation Details 

The algorithms described in the previous chapters have 
been implemented in an extended form of the AlgolW 
programming language using a slightly modified version of a 
compiler jointly developed by the University of Newcastle 
upon Tyne and the University of Michigan. A description of 


both the language and the compiler used can be found in 
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[24]. The implementation was accomplished on the MTS 
Operating system, a somewhat dated but still fairly accurate 
general description of which can be found in the article by 
Boettner and Alexander [3]. 

One of the features of the AlgolW language is that any 
integer product cannot have a result larger than 2147483647. 
Thus, if exact arithmetic is to be retained, no element of 
the GF(p) used as the coefficient field can be larger than 
46340. Since 40961 is the largest Fourier prime within this 
bound, GF(40961) becomes the optimal choice for use as the 
coefficient domain in AlgolW, and is the field so used in 
the present implementation. 

The largest value of n, where n is a power of 2, Such 
that an nth primitive root of unity exists in GF(40961) is 
n = 4096, which means that the fast polynomial and power 
series multiplication technique can be used to obtain 
products with a maximum degree of 4095. An empirical 
examination shows that 3 is an appropriate (4096)th 
primitive root of unity for this field. All other primitive 
roots of unity required for obtaining smaller results using 
fast arithmetic are simple integral powers of this given 
primitive root. To cut down on the affect of manipulating 
roots of unity on the execution times obtained, all powers 
of this primitive root are stored, in the present 
implementation, as constants in an array, so that obtaining 
any particular root of unity simply consists of determining 


the appropriate index into this array. 
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The primary method used for storing polynomials and 
power series is as a linked list of coefficients, using 
AlgolW's RECORD and REFERENCE facility, with no special 
distinction being made between dense and sparse polynomials 
and/or power series. To increase efficiency and to provide 
better control over the creation of such storage, AlgolW's 
dynamic RECORD allocation mechanism has been bypassed in 
favour of making direct system calls to obtain large blocks 
of storage which are then dispensed under greater control. 
Furthermore, because AlgolW has no storage reclamation 
facifimty seal Storage 1S returned by each procedure to a 
central pool after it is no longer needed so that it can be 
re-used in later calculations. 

When performing polynomial and power series arithmetic, 
the use of linked lists is inferior to a more directly 
accessible structure. Therefore, when performing any 
calculations, the coefficients of the polynomials or power 
series involved are transferred to an array structure until 
the operation is complete, after which the results are 
converted back into linked list form. 

In order to obtain execution time values that are as 
independent as possible of the storage mechanism used, two 
different time values are obtained by each of the programs. 
The first value consists of the complete execution time for 
each Pade procedure. The second value is the time spent in 
storage manipulation for each Pade procedure. This latter 


value includes the time spent for polynomial and power 
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series storage allocation and reclamation as well as the 
time for converting from linked list format to array format 
and vice-versa. In this way, a non-Storage specific time 
value can easily be determined. To prevent the time 
accumulation operation from interfering with the execution 
time values, each call to the standard AlgolW TIME procedure 
has been replaced with an SVC instruction (see Boettner and 
Alexander [3]) which returns the elapsed problem state time 


for the active task. 


4.3 Empirical Results 

In order to determine the practical efficiency of each 
of the four Pade algorithms, the implementation just 
described was used to obtain CPU time costs for each 
algorithm using e number of sample input power series. The 
programs were executed under the MTS operating system 
running on an Amdahl 470/V8 mainframe computer with a real 
memory space of 32 megabytes. The programs were executed 
with that execution as the only active task on the machine; 
no other user or system programs were operational at the 
time of execution. All of the background system tasks, such 
as the STAT job and LOADLEVL, were turned off in attempt to 
eliminate as much system interference as possible. To 
guarantee that there would be no interference in the results 
due to memory paging, the paging drum processor (PDP), which 
handles that particular system function, was also turned 


off, since a shortage of memory would not be a problem with 
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the available memory space. 

The sample set of power series used to obtain the 
empirical CPU costs consisted of 18 power series with 
coefficients generated at random from the coefficient field, 
GF(40961). For 10 of these power series, the degree of 
conversion (the value of n when computing an (n,n) diagonal 
Pade approximant) was also chosen at random from between 0 
and 250. The degree of conversion for the other 8 power 
series ranged from 250 to 2000 in increments of 250. 

Each of the 18 power seriesS was converted twice using 
each of the four Pade algorithms: once using classical 
polynomial and power series arithmetic and once using fast 
arithmetic, resulting in 8 total CPU time values and 8 
associated storage manipulation time values for each sample 
power series. The actual CPU time values for each power 
series were then obtained by subtracting the storage 
manipulation time from the total CPU time value for each 
execution, producing a total of 8 actual CPU time values for 
each sample power series. These actual execution times are 
plotted, with the CPU time as a function of the degree of 
conversion, in the graph in Figure 1. In this graph, the 
solid lines represent the costs using classical polynomial 
and power series arithmetic while the broken lines represent 
the costs using fast arithmetic. For the sake of 
completeness, the actual CPU time values obtained, as well 
as the times spent by each program in storage manipulation, 


can be found in Appendix B at the end of this presentation. 
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In Figure 2, a graph of the theoretical operation 
counts, plotted with the operation counts as a function of 
the degree of conversion, is also included to allow the 
making of comparisons. Solid lines are again used to 
indicate the use of classical arithmetic while broken lines 
again represent the use of fast arithmetic. A comparison of 
the two graphs indicate that, in terms of the general 
appearance of the graphs, there is little difference between 
the empirical behaviour and the theoretical behaviour of the 
four algorithms. A couple of noticable differences do exist 
between the two graphs, however, which can largely be 
attributed to two causes. 

To begin with, the theoretical cost graph, as plotted, 
assumes that the value of n in each of the ee equations 
corresponds to the degree of conversion. While this is true 
when uSing classical arithmetic, the value of n when using 
fast arithmetic is actually closer to the nearest power of 2 
greater than or equal to the degree of conversion. AS a 
result, the lines in the theoretical cost graph 
corresponding to the use of fast arithmetic should more 
properly appear as staircases, rather than as smooth curves, 
indicating that the empirical CPU time graph is more 
indicative of the true behaviour of the Pade algorithms when 
using fast polynomial and power series arithmetic. 

The second reason for differences between the two 
graphs is that there is an imbalance in the hidden costs, 


mostly in the form of procedure calls, that exist in the 
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implementation of the four algorithms. For example, in the 
implementation, each polynomial or power series operation 
requires a procedure call, the cost of which is not 
accounted for in any of the theoretical cost equations. 
Since the number of times that such costs are incurred is 
not the same for each of the algorithms, this has the effect 
of slightly altering the relations that exist between the 
four algorithms. In particular, it is for this reason that 
some of the crossover points in the empirical CPU time graph 
do not occur at the same place as they occur in the 
theoretical cost graph. 

Up to this point, very little has been said about the 
storage manipulation times obtained. The reason for this is 
that the storage manipulation times represent only the costs 
of converting a polynomial or power series from a linked 
list structure to an array structure and back again. These 
costs are not an intrinsic property of the various 
algorithms and they would not exist if polynomials and power 
series are only stored using arrays. These time values are 
included in Appendix B simply to give an idea of how the use 


of linked lists affects the implementation. 


4.4 Conclusions 

As stated in the introduction, the ability to compute 
Pade approximants for a power series is useful to the 
efficient manipulation of rational functions in a symbolic 


computer algebra system. Motivated by this fact, a study was 
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undertaken to compare the best methods currently available 
for computing diagonal Pade approximants. Both the 
theoretical cost graph and the empirical cost graph that 
resulted from this study point to a very definite conclusion 
concerning the relative efficiencies of the four Pade 
algorithms considered. 

It appears from both graphs that initially, for a 
substantial class of problems, the DIAG algorithm using 
classical polynomial and power series arithmetic is the best 
method for computing diagonal Pade approximants. Then, after 
the problem exceeds a certain size, the OFFDIAG algorithm 
uSing fast arithmetic takes over in this role. The other 
methods for computing diagonal Pade approximants are always 
inferior to at least one of these two approaches. 

The actual point at which the use of OFFDIAG becomes 
Superior to the use of DIAG varies depending on whether the 
empirical or the theoretical results are being used. In the 
case of the former, the crossover occurs at a degree of 
conversion of about 1700 while in the latter case, the 
degree of conversion for the crossover point is about 2700. 
For practical purposes, it is probably safest to use the 
empirical value as the crossover point in any implementation 
that combines the use of both methods (as proposed shortly) 
Since both graphs indicate that the cost of the DIAG 
algorithm using classical arithmetic takes a sharp rise at 
this point, making it undesirable to keep using that 


approach for too long a period of time. 
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An immediate consequence of the high value of the 
crossover point given above is that it raises some question 
as to the usefulness of the OFFDIAG algorithm using fast 
arithmetic. It is not known whether there are any 
applications currently in use that require Pade approximants 
large enough for the OFFDIAG algorithm to become effective. 
In the absence of this information, however, it is probably 
not safe to eliminate the use of the OFFDIAG algorithm in a 
general system for handling rational functions using 
truncated power series until the intended applications are 
better known. 

As a result of all this, it appears that the best 
approach to computing diagonal Pade approximants efficiently 
using any of the given four methods is to design an 
algorithm that takes advantage of the efficiencies of both 
the DIAG and the OFFDIAG techniques. Such an algorithm would 
initially employ the DIAG algorithm using classical 
arithmetic to compute consecutive Pade approximants along 
the main diagonal until a specific degree of conversion is 
reached. After this point, the OFFDIAG algorithm using fast 
arithmetic would be used to continue the process of 
computing approximants along the diagonal (although these 
would no longer be consecutive) until the desired one is 
reached. The recursive call made by the OFFDIAG procedure 
would not call OFFDIAG again immediately, as it currently 
does, but would instead call this combined procedure so that 


full advantage could be made of both algorithms in their 
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most efficient environment. Such a combined use of the DIAG 
and OFFDIAG algorithms would, quite naturally, be superior 
to the use of OFFDIAG only, and its crossover with the DIAG 
algorithm would be lower than the crossover of the DIAG and 
OFFDIAG algorithms. As a result, it becomes necessary to 
re-determine the point at which the combined algorithm 


becomes superior to the DIAG algorithm. 


4.5 Thoughts for Future Research 

The algorithms and results presented so far apply to 
power series and polynomials with coefficients that lie ina 
field. A substantial number of problems (perhaps even the 
majority) deal with polynomials and power series whose 
coefficients come from a more general algebraic system, most 
notably a unique factorization domain (ufd). It, therefore, 
becomes interesting to speculate as to whether these results 
can be extended to this larger class of polynomials and 
power series. 

One way in which all four algorithms can be adapted to 
compute Pade approximants for a power series whose 
coefficients lie in a unique factorization domain is to make 
use of modular techniques. Using this approach, the initial 
power series is first mapped onto several Galois fields 
defined within the ufd using prime elements selected from 
the ufd. (Of course, Fourier primes should be selected in 
order to retain the ability to use Fourier-based fast 


arithmetic within the resulting fields.) The power series to 
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rational function conversion operation is then performed 
over each of these fields using the desired Pade algorithm. 
The results produced by each of these conversions are then 
combined using the Chinese Remainder Algorithm, as described 
by Lipson [17,19], to form a single solution over the given 
unique factorization domain. 

The only costs introduced by the use of modular 
techniques is the cost of mapping the initial power series 
onto the selected Galois fields and the cost of the Chinese 
Remainder fee Om cata Since these costs are the same 
regardless of the method used to compute the Pade 
approximants, the relative efficiencies of the four Pade 
algorithms as applied to power series over a unique 
factorization domain using this approach remain the same as 
when the coefficient domain is a field. 

A major disadvantage to the use of the Chinese 
Remainder Algorithm is that solutions over a potentially 
large number of different Galois fields may be required to 
produce the desired Pade approximant over the ufd. As an 
alternative to the use of this algorithm, it appears that 
decoupled p-adic construction via the Hensel Algorithm, a 
description of which is given by Yun [26], may be useful for 
mapping Pade approximants from a Galois field back to a ufd 
in much the same way as it is applied to the problem of 
polynomial division over a ufd. The advantage of using the 
p-adic technique is that only a single Galois field is 


required to produce the desired Pade approximant over the 
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given unique factorization domain. As with the use of the 
Chinese Remainder Algorithm, the relative efficiencies of 
the four Pade algorithms when applied over a unique 
factorization domain using p-adic construction would remain 
the same as when performed over a field. 

A third, and perhaps the most obvious, way to extend 
the four given Pade algorithms to the. class of power series 
over a ufd is to perform all power series and polynomial 
arithmetic directly over the desired unique factorization 
domain rather than over a field. Unfortunately, this 
approach suffers from a few difficulties. To begin with, 
Since inverses of coefficients do not exist when the 
coefficient domain 1s a ufd, it is not possible to perform 
exact polynomial and power series division within this 
domain, making it necessary, instead, to use pseudo-division 
of polynomials and power series, as outlined, for example, 
by Knuth [16]. The problem with the use of pseudo-division 
is that it produces explosive coefficient growth in the 
results of calculations, a phenomenon well-known in the 
study of computer algebra. It 1S possible that the same 
technique used to ease this problem in the Subresultant PRS 
Algorithm for computing polynomial GCDs (see Knuth [16]) may 
also be used to reduce the rate of such coefficient growth 
in the Pade approximation algorithms, but this is not a 
certainty and requires further investigation. 

A second difficulty with performing polynomial and 


power series arithmetic directly over a ufd is that 
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primitive roots of unity do not exist in a ufd so that fast 
polynomial and power series multiplication and, 
consequently, division based on the Fourier transform can 
not be performed directly over a unique factorization 
domain. Furthermore, an alternative method for performing 
fast polynomial and power series arithmetic over a ufd does 
not currently exist. It may be possible to employ modular 
techniques, uSing the Chinese Remainder Algorithm, to 
produce the desired multiplication result over a ufd from 
products computed using fast multiplication over several 
Galois fields properly chosen from within the ufd, but it is 
unclear as to whether the superior order of complexity Be 
fast multiplication can be retained using this approach. 
Another possibility is that the polynomial or power series 
product can be computed over a single Galois field generated 
from within the ufd using a prime element that is at least 
twice as large as any coefficient in the result as produced 
over the ufd, bounds for which can easily be determined. If 
the proper residue class is used in this Galois field, then 
the product produced is the same as the result would be over 
the ufd. While the order of complexity for fast 
multiplication over a ufd remains the same using this 
approach as for over a field, it suffers from one difficulty 
in that, for the general case, where no practical maximum 
exists on the size of coefficients in the ufd, no single 
prime element is sufficient for producing any arbitrary 


product, making it necessary to continually find a new one, 
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along with a primitive root of unity for the field thus 
defined, for each multiplication problem, the cost of which 
sGnoertrivialt 

Another factor that needs to be considered when 
performing arithmetic directly over a ufd is that the cost 
of performing unit operations no longer remains a constant. 
Regardless of whether classical or fast arithmetic is being 
used, it inevitably becomes necessary, at a practical level, 
to perform indefinite precision arithmetic involving 
elements from the ufd since most unique factorization 
domains, such as the integers, contain a wealth of values 
that exceed the precision of most computers. Thus the cost 
of unit arithmetic in a ufd varies depending on the size of 
the elements being operated on. 

In a Slightly different context, a final question which 
should be raised is whether the results presented for the 
computation of Pade approximants for a power Series over a 
field can still be improved. The answer to this is unclear. 
At present, no known work is being done on either the AD or 
the MD approaches to solving the Pade approximation problem. 
However, some study is being done to see if the power series 
division operations required by the DIAG and OFFDIAG 
algorithms can be removed in a manner similar to the way in 
which they are avoided by the MD algorithm. If this is 
possible, the costs of these two algorithms should improve 
dramatically since these operations represent the single 


largest contributor to the cost of these two methods, 
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particularly when using fast polynomial and power series 
arithmetic. Initial results in this area by Choi [7] 
indicate that the divisions can be eliminated for computing 
‘diagonal Pade approximants for a power series that is normal 
along the mainMdiagonal of its Pade table. It will be 
interesting to see if this result can be extended to include 


the more general class of abnormal power series as well. 
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Appendix A: Determining Theoretical Cost Estimates 


A theoretical cost estimate for an algorithm is 
obtained by accumulating a count of the number of base 
Operations, expressed in terms of the input size, required 
to execute the algorithm. In most cases, only the order of 
magnitude of such an eStimate is required. However, in the 
present discussion, more accurate estimates are desired, 
making it necessary to have exact methods for counting unit 
operations. It is the purpose of this appendix to examine 
some of the techniques used for producing the cost estimates 
given in the current presentation. 

The accumulation of the exact number of operations 
required to execute a linear sequence of statements in an 
algorithm is a straightforward, and simple, counting 
exercise. However, two algorithmic constructs, the loop and 
recursion, require somewhat more elaborate techniques to 
obtain the operation count. 

To determine the total number of operations for 
executing a loop, it is necessary to be able to sum up a 
finite sequence of values. Normally, these sequences have a 
rigid fixed pattern to them. For the algorithms described in 


this presentation, four types of sequences occur. 
Theorem A.0.1 


Let a,n be arbitrary positive integers. Then 
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The correctness proofs for these four equations are simple 
inductive exercises. Two useful properties of such sums of 


sequences are given as follows: 


Theorem A.0.2 


Let f£(i),g(i) be two functions with i as their input 


and let c be a constant. Then 
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Again, the inductive proofs of the correctness of these 
equations are trivial. The results of these two theorems are 
all that is required for obtaining cost estimates for the 
iterative aspects of the algorithms in the present 
discussion. 

In handling the recursive aspects of any algorithms, it 
is necessary to be able to convert the resulting recurrence 
relation into a more iterative equation of the type given by 


the previous two theorems. Two types of recurrence relations 
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need to be solved when determining cost estimates for the 
algorithms in the current discussion. The general solution 
to the first type of recurrence relation can be attributed 


to Bentley, Haken, and Saxe [2]. 
Theorem A.0.3 


Assume Tin) = kT(n/2)+£(n) where n = a iet i a= logk, 
. log n A 
gin} = f(n)/ma, ana h(n) = 2 g(2°). Then 
11 


T(n) = n°[T(1)+h(n)]. 


Proof: 


The assertion holds by mathematical induction: 
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Therefore T(n) = 2°T(n/2)+£(n) 


2°T(n/2)+n°g(n). 


Note also that h(n) = h(n/2)+g(n). 


Assume T(n) = n°©[T(1)+h(n)]. 


Then T(2n) KP Cn thon 


2°T(n)+(2n)°g(2n) 


2° (n° [T(1)+h(n) ])+(2n)°g(2n) 


(2n)°[T(1)+h(n)+g(2n) ] 


(2n)°[T(1)+h(2n)]. 


Ove .D. 
Note that in the recurrence relation solved by this theorem, 
the value of n iS assumed to be an exact power of 2, a 
Situation that holds for all recurrence relations which 


occur in this presentation. The second recurrence relation 
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which needs to be solved is only required, in this 
presentation, by the cost equation for the OFFDIAG 


algorithm. 
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In this second case, the recurrence relation is converted 
into another recurrence relation of the first type which can 


then be solved using the technique given by Theorem A.0.3. 
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Appendix B: Empirical CPU Time Values 


For the sake of completeness, the actual CPU time 


values, obtained by Subtracting the storage manipulation 
times from the total CPU times for each algorithm, as well 
as the times spent by each algorithm in storage 

the four 


manipulation, are presented in this appendix. In 


tables that follow, the first column, denoted by n, 
corresponds to the degree of conversion for each particular 
the 


power series converted in the sample runs. Since 


generation and implications of these time values are already 
discussed in Chapter 4 of this presentation, there is little 
more that needs to be said about them, so they are now given 


without any further comment. 
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Table 2: CPU Times using Fast Arithmetic 
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Table 3: Storage Manipulation Times using Classical 


Arithmetic 
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Table 4: Storage Manipulation Times using Fast Arithmetic 
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